• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data - JAX2011 (Pavlo Baron)
 

Big Data - JAX2011 (Pavlo Baron)

on

  • 2,916 views

Slides of my Big Data talk at JAX2011 in Mainz, May 2011. For the demo code and detail description of the demo, a slide with URLs is included

Slides of my Big Data talk at JAX2011 in Mainz, May 2011. For the demo code and detail description of the demo, a slide with URLs is included

Statistics

Views

Total Views
2,916
Views on SlideShare
2,884
Embed Views
32

Actions

Likes
1
Downloads
0
Comments
0

4 Embeds 32

http://coderwall.com 17
http://www.codecentric.de 8
https://www.codecentric.de 4
http://www.techgig.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data - JAX2011 (Pavlo Baron) Big Data - JAX2011 (Pavlo Baron) Presentation Transcript

    • Big Data Pavlo Baron
    • Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
    • So, you think you’re having Big Data?
    • Don’t think
    • Know
    • Know your data
    • Know your data your scenarios
    • Know your data your scenarios how to scale
    • Know your data your scenarios how to scale the technology
    • Know your data your scenarios how to scale the technology when to stop
    • Know your data your scenarios how to scale the technology when to stop
    • Where does your data actually come from ?
    • Do you have a million well structured records?
    • Or a couple of Gigabytes of storage?
    • Does your data get modified every now and then ?
    • Do you look at your data once a month to create a management report?
    • Or is your data an unstructured chaos?
    • Do you get flooded by tera-/petabytes of data?
    • Or do you simply get bombed with data?
    • Does your data flow on streams at a very high rate from different locations?
    • Or do you have to read The Matrix ?
    • Do you need to distribute your data over the whole world
    • Or does your existence depend on (the quality of) your data?
    • Know your data your scenarios how to scale the technology when to stop
    • Is it the storage that you need to focus on?
    • Or are you more preparing data?
    • Or do you have your customers spread all over the world ?
    • Or do you have complex statistical analysis to do?
    • Or do you have to filter data as it comes?
    • Or is it necessary to visualize the data?
    • Know your data your scenarios how to scale the technology when to stop
    • To scale for Big Data means to...
    • Chop in smaller pieces
    • Chop in bite-size , manageable pieces
    • Separate reading from writing
    • Minimize hard relations
    • Separate archive from accessible data
    • Trash everything that has only to be analyzed in real-time
    • Parallelize and distribute
    • Strive after spatial proximity to processors
    • Utilize commodity hardware
    • Relax new hardware startup procedure
    • Consider hardware fallibility
    • Strive after spatial proximity to users
    • Consider network unreliability
    • Design with eventual actuality/consistency in mind
    • Design with Byzantine faults in mind
    • Consider latency an adjustment screw
    • Consider availability an adjustment screw
    • Be prepared for disaster
    • Utilize the fog/clouds
    • Design for theoretically unlimited amount of data
    • Design for frequent structure changes
    • Design for the all-in-one mix
    • Know your data your scenarios how to scale the technology when to stop
    • It’s not sufficient anymore just to throw it at your Oracle DB and to hope it works
    • You have no chance without science
    • To manage Big Data means to learn/know
    • To manage Big Data means to learn/know algorithms/ADTs
    • To manage Big Data means to learn/know algorithms/ADTs computing systems
    • To manage Big Data means to learn/know algorithms/ADTs computing systems networking
    • To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems
    • To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems
    • To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems distributed systems
    • So, you know your data . You know your scenarios . You know the theory
    • Now pick the right tools for the job
    • I have thousands of log records per second . I want to store them immediately , but reliably for later statistics. How would I do that?
    • Consider DHTs , P2P systems, distributed data stores etc.
    • In order to write fast , distribute to several nodes with sloppy non-durable write quorum
    • Build upon a system implementing consistent hashing . Don’t try home-made sharding as distribution replacement - you will fail adding new nodes
    • Try Riak . It’s derived from Amazon Dynamo. It implements consistent hashing, gossip architecture, hinted handoff, vector clocks, merkle trees, sloppy quorum etc.
    • I need to analyze these records in real-time for patterns and send alerts when any of them match. How would I do that?
    • Consider CEP – complex event processing with an EPL – event processing language
    • Choose a sliding time window just big enough to recognize causality and fire events on thresholds, deviations, anomalies etc.
    • Try Esper . It implements CEP/EPL
    • I collect my data at different locations all over the world , but want to do statistical analysis in my headquarters or at one other location . How would I do that?
    • Aggregate your data and push it e.g. once a day out to the cloud . That’s a sort of replication if you like
    • Choose a cloud based data store which can store big objects . The store should provide similar consistency characteristics as your local data store
    • Try AWS (S3) or Rackspace (OpenStack/Swift) or a private cloud . They are either directly Dynamo based or implement similar concepts
    • That’s a lot of data and distribution. I need to quickly push it from a location into the cloud while data keeps coming in . How would I do that?
    • Use MapReduce to distribute the aggregation job to a group of nodes in order to quickly get the overall aggregation and cloud storage done
    • Map , sort , combine and reduce to whatever representation you need
    • Separate MapReduce splitting, jobs and intermediate storage from the local store to keep them independent and thus to read local store snapshots while still writing the new data
    • Try Hadoop . It implements MapReduce with an own file system (HDFS), distribution etc. It is highly extensible
    • I need to do some statistical analysis and visualize my data. How would I do that?
    • Choose a general purpose platform for statistical computing and graphics
    • Try R . It allows statistical analysis of whatever type of data and its graphical plotting. It’s highly extensible
    • And these are only some of the possible Q&As. There are more areas such as NoSQL , content preparation for CDN , data mining etc. which we didn’t consider
    • Know your data your scenarios how to scale the technology when to stop
    • The experiment – live demo Source code will be available on http://github.com/pavlobaron Detail description will be available on http://archi-jab-ture.blogspot.com
    • Situation Data Center (US) Data Center (EUR) Data Center (AFR) Votes Votes Reports
    • SoapUI Simulation HTTP server Esper Alert Riak Object Storage Hadoop R
    • We store votes as they come with sloppy write quorum. We store on several nodes in a regional cluster. We match patterns on a stream, not on saved data. We push aggregated day data from regions to the cloud using distributed MapReduce. We use scalable, distributed components with HA options. Etc. Does it scale ?
    • Thank you
    • Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages