• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012
 

The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

on

  • 357 views

Session presented at Big Data Spain 2012 Conference ...

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/the-promise-and-peril-of-abundance-making-big-data-small/brendan-mcadams

Statistics

Views

Total Views
357
Views on SlideShare
357
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012 The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012 Presentation Transcript

    • A Modest Proposal for Taming and Clarifying the Promises of Big Data and the Software Driven Future Brendan McAdams 10gen, Inc. brendan@10gen.com @ritFriday, November 16, 12
    • "In short, software is eating the world." - Marc Andreesen Wall Street Journal, Aug. 2011 http://on.wsj.com/XLwnmoFriday, November 16, 12
    • Software is Eating the World • Amazon.com (and .uk, .es, etc) started as a bookstore • Today, they sell just about everything - bicycles, appliances, computers, TVs, etc. • In some cities in America, they even do home grocery delivery • No longer as much of a physical goods company - becoming fixated and surrounded by software • Pioneering the eBook revolution with Kindle • EC2 is running a huge percentage of the public internetFriday, November 16, 12
    • Software is Eating the World • Netflix started as a company to deliver DVDs to the home...Friday, November 16, 12
    • Software is Eating the World • Netflix started as a company to deliver DVDs to the home... • But as they’ve grown, business has shifted to an online streaming service • They are now rolling out rapidly in many countries including Ireland, the UK, Canada and the Nordics • No need for physical inventory or postal distribution ... just servers and digital copiesFriday, November 16, 12
    • Disney Found Itself Forced To Transform... From This...Friday, November 16, 12
    • Disney Found Itself Forced To Transform... ... To ThisFriday, November 16, 12
    • But What Does All This Software Do? • Software always eats data – be it text files, user form input, emails, etc • All things that eat, must eventually excrete...Friday, November 16, 12
    • Ingestion = Excretion + = Yeast Ingests Sugars, and Excretes EthanolFriday, November 16, 12
    • Ingestion = Excretion = Cows, er... well, you get the point.Friday, November 16, 12
    • So What Does Software Eat? • Software always eats data – be it text files, user form input, emails, etc • But what does software excrete? • More Data, of course... • This data gets bigger and bigger • The solutions become narrower for storing & processing this data • Data Fertilizes Software, in an endless cycle...Friday, November 16, 12
    • There’s a Big Market Here... • Lots of Solutions for Big Data • Data Warehouse Software • Operational Databases • Old style systems being upgraded to scale storage + processing • NoSQL - Cassandra, MongoDB, etc • Platforms • HadoopFriday, November 16, 12
    • Don’t Tilt At Windmills...Friday, November 16, 12
    • Don’t Tilt At Windmills... • It is easy to get distracted by all of these solutions • Keep it simple • Use tools you (and your team) can understand • Use tools and techniques that can scale • Try not to reinvent the wheelFriday, November 16, 12
    • ... And Don’t Bite Off More Than You Can Chew • Break it into smaller pieces • You can’t fit a whole pig into your mouth... • ... slice it into small parts that you can consume.Friday, November 16, 12
    • Big Data at a Glance Large Dataset Primary Key as “username” • Big Data can be gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?Friday, November 16, 12
    • Big Data at a Glance ... Large Dataset Primary Key as “username” • Systems like Google File System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalancedFriday, November 16, 12
    • Chunks Represent Ranges of Values Initially, an empty collection has a single -∞ +∞ chunk, running the range of minimum (-∞) to ... INSERT {USERNAME: “Bill”} maximum (+∞) As we add data, more chunks are created of -∞ “B” “C” +∞ new ranges INSERT {USERNAME: “Becky”} INSERT {USERNAME: “Brendan”} Individual or partial letter -∞ “Ba” “Be” “Br” ranges are one possible chunk value... but they can get smaller! INSERT {USERNAME: “Brad”} The smallest possible chunk value is not a “Brad” “Brendan” range, but a single possible valueFriday, November 16, 12
    • Big Data at a Glance a b c d e f g h ... Large Dataset Primary Key as “username” s t u v w x y z • To simplify things, let’s look at our dataset split into chunks by letter • Each chunk is represented by a single letter marking its contents • You could think of “B” as really being “Ba” →”Bz”Friday, November 16, 12
    • Big Data at a Glance a b c d e f g h Large Dataset Primary Key as “username” s t u v w x y zFriday, November 16, 12
    • Big Data at a Glance Large Dataset Primary Key as “username” x b v t d f z s h e u c w a y g MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)Friday, November 16, 12
    • Big Data at a Glance Data Node 1 Data Node 2 Large Dataset Node 3 Data Data Node 4 Primary Key as “username” 25% of chunks 25% of chunks 25% of chunks 25% of chunks x b v t d f z s h e u c w a y g Representing data as chunks allows many levels of scale across n data nodesFriday, November 16, 12
    • Scaling Data Node 1 Data Node 2 Data Node 3 Data Node 4 5 Data Node x b v t d f z s h e u c w a y g The set of chunks can be evenly distributed across n data nodesFriday, November 16, 12
    • Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 x c b z t f v y a s u g e w h d The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance.Friday, November 16, 12
    • Don’t Bite Off More Than You Can Chew... • The answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of resultsFriday, November 16, 12
    • Bite Sized Pieces Are Easier to Swallow • These pieces are not chunks – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processingFriday, November 16, 12
    • MapReduce the Pieces • The most common application of these techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasetsFriday, November 16, 12
    • MapReduce to Calculate Big Data • MapReduce is designed to effectively process data at varying scales • Composable function units can be reused repeatedly for scaled resultsFriday, November 16, 12
    • MapReduce to Calculate Big Data • In addition to the HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engineFriday, November 16, 12
    • What is MapReduce? • MapReduce made up of a series of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received emailFriday, November 16, 12
    • MapReducing Email to: tyler from: brendan subject: Ruby Support to: brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?)Friday, November 16, 12
    • Map Step map function breaks each document to: tyler into a key (grouping) & value key: tyler from: brendan value: {count: 1} subject: Ruby Support to: brendan from: tyler key: brendan subject: Re: Ruby Support value: {count: 1} to: mike from: brendan subject: Node Support key: tyler value: {count: 1} map function to: brendan emit(k, v) from: mike subject: Re: Node Support key: mike value: {count: 1} to: mike from: tyler key: brendan subject: COBOL Support value: {count: 1} to: tyler from: mike subject: Re: COBOL Support key: mike (WTF?) value: {count: 1}Friday, November 16, 12
    • Group/Shuffle Step key: tyler value: {count: 1} key: brendan Group like keys together, value: {count: 1} creating an array of their key: tyler value: {count: 1} distinct values (Automatically done by M/R frameworks) key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1}Friday, November 16, 12
    • Group/Shuffle Step Group like keys together, key: tyler creating an array of their values: [{count: 1}, {count: 1}] distinct values key: mike values: [{count: 1}, {count: 1}] (Automatically done by M/R frameworks) key: brendan values: [{count: 1}, {count: 1}]Friday, November 16, 12
    • Reduce Step For each key reduce function flattens the list of values to a single result key: tyler key: mike values: [{count: 1}, value: {count: 2} {count: 1}] key: mike key: brendan reduce function values: [{count: 1}, value: {count: 2} aggregate values {count: 1}] return (result) key: brendan key: tyler values: [{count: 1}, value: {count: 2} {count: 1}]Friday, November 16, 12
    • Processing Scalable Big Data • MapReduce provides an effective system for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?Friday, November 16, 12
    • Batch Isn’t a Sustainable Answer • There are downsides here - fundamentally, MapReduce is a batch process • Batch systems like Hadoop give us a “Catch 22” • You can get answers to questions from Petabytes of Data • But you can’t guarantee you’ll get them quickly • In some ways, this is a step backwards in our industry • Business Stakeholders tend to want answers now • We must evolveFriday, November 16, 12
    • Moving Away from Batch • The Big Data world is moving rapidly away from slow, batch based processing solutions • Google moved forward from Batch into more Realtime over last few years • Hadoop is replacing “MapReduce as Assembly Language” with more flexible resource management in YARN • Now MapReduce is just a feature implemented on top of YARN • Build anything we want • Newer systems like Spark & Storm provide platforms for realtime processesFriday, November 16, 12
    • In Closing • The World IS Being Eaten By Software • All that software is leaving behind an awful lot of data • We must be careful not to “step in it” • More Data Means More Software Means More Data Means... • Practical Solutions for Processing & Storing Data will save us • We as Data Scientists & Technologists must always evolve our strategies, thinking and toolsFriday, November 16, 12
    • [Download the Hadoop Connector] http://github.com/mongodb/mongo-hadoop [Docs] http://api.mongodb.org/hadoop/ ¿QUESTIONS? *Contact Me* brendan@10gen.com (twitter: @rit)Friday, November 16, 12