Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Big Data


Published on

Introduction to Big Data - how we got here, how we can't avoid the topic anymore.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Introduction to Big Data

  1. 1. Introduction to Big Data
  2. 2. Definition of Big Data  "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.“  "Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.“  Data growing way faster than computation speeds  A single machine can no longer process or even store all this data! The Big Data problem
  3. 3. Where does Big Data come from?  Online recorded content:  Clicks  Ad views  Server requests  .. everything what happens online can potentially be recorded  User generated content (Facebook, Twitter, Instagram, etc)  Smartphone users reach to their phone 150 times a day (2013)  Health and scientific computing  Large Hadron Collider produces about double amount of data than Twitter every year  Internet of Things (IoT)  smart thermostat systems  automobiles with built-in sensors  all kind of “smart” devices of various sizes
  4. 4. Example scales of Big Data  EIR communication logs: 1.4 TB / day  Facebook logs: 60 TB / day  Google total web index: ~10+ PB (10000TB)  Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)  a reminder..  time to read 1TB from disk: 3 hours (100MB/s)  Google web index could be read from disk serialized in ~3.4 years
  5. 5. How do we program this thing? 6
  6. 6. OK but I don’t work at Google yet ...
  7. 7. Startup example  Let’s design a simple web tracker from scratch  Register and count each page view for a number of clients  “Keep simple things simple”  Version 1.0:  Problem?  Huge number of page views => massive DB load on concurrent updates => DB timeouts => FAIL
  8. 8. Version 2.0  Why write each count?!  Let’s introduce a queue and buffer updates  Problem?  # of page views and # of clients keep increasing => DB overload => FAIL
  9. 9. Version 3.0  The bottleneck is the write-heavy DB  Let’s shard the database!  Problems?!  Have to keep adding new servers and re-sharding existing databases  Re-sharding online is tricky (maybe introduce pending queues?)  A single code failure corrupts a huge set of data collected over years  Maintenance nightmare
  10. 10. Is there a way out?  We need new tools which handle:  automatic sharding and re-sharding  automatic replication and rebalancing  fault tolerance  effortless horizontal scaling  But we need to adapt ourselves as well. We need:  a new definition of “data” (data ≠ information)  new architectures (Lambda Architecture)  immutable data (for scaling and fault tolerance)  functional programming concepts  No, writing 25 years old structural code in this year’s favorite language won’t cut it anymore
  11. 11. Big Data tooling  Apache Hadoop distributed filesystem (HDFS)  Distributed, scalable, portable filesytem written in Java  Open source, 10 years old (!) project  Handles files in the gigabytes-terabytes range  Manages automatic replication and rebalancing of data  Facebook had 21 PB of storage on HDFS in 2010  Yahoo had a cluster of 10 000 Hadoop nodes in 2008  Apache Spark  Next generation data processing engine written in Scala  Open source, 5 years old project  Up to 100 times faster than Hadoop MapReduce  Uses functional programming techniques to process data  Can scale down to get run in an IDE!
  12. 12. Apache Spark by a glance 13
  13. 13. The good news  The right tools are available and open-source  The knowledge is available and mostly free  It’s all ready to get learned!