Making Big Data, small
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Making Big Data, small






Total Views
Views on SlideShare
Embed Views



1 Embed 6 6



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Making Big Data, small Presentation Transcript

  • 1. MAKING BIG DATA, SMALLUsing distributed systems for processing, analysing and managinglarge huge data sets Marcin Jedyk Software Professional’s Network, Cheshire Datasystems Ltd
  • 2. WARM-UP QUESTIONS How many of you heard about Big Data before? How many about NoSQL? Hadoop?
  • 3. AGENDA. Intro – motivation, goal and ‘not about…’ What is Big Data? NoSQL and systems classification Hadoop & HDFS MapReduce & live demo HBase
  • 4. AGENDA Pig Building Hadoop cluster Conclusions Q&A
  • 5. MOTIVATION Data is everywhere – why not to analyse it? With Hadoop and NoSQL systems, building distributed systems is easier than before Relying on software & cheap hardware rather than expensive hardware works better!
  • 7. GOAL To explain basic ideas behind Big Data To present different approaches towards BD To show that Big Data systems are easy to build To show you where to start with such systems
  • 8. WHAT IT IS NOT ABOUT? Not a detailed lecture on a single system Not about advanced techniques in Big Data Not only about technology – but also about its application
  • 9. WHAT IS BIG DATA? Data characterised by 3 Vs:  Volume  Variety  Velocity The interesting ones: variety & velocity
  • 10. WHAT IS BIG DATA Data of high velocity: cannot store? Process on the fly! Data of high variety: doesn’t fit into relational schema? Don’t use schema, use NoSQL! Data which is impractical to process on a single server
  • 11. NO-SQL Hand in and with Big Data NoSQL – an umbrella term for non-relational data bases or data storages It’s not always possible to replace RDBMS with NoSQL! (opposite is also true)
  • 12. NO-SQL NoSQL DBs are built around different principles  Key-value stores: Redis, Riak  Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)  Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records
  • 13. HADOOP Existed before ‘Big Data’ buzzword emerged A simple idea – MapReduce A primary purpose – to crunch tera- and petabytes of data HDFS as underlying distributed file system
  • 14. HADOOP – ARCHITECTURE BY EXAMPLE Image you need to process 1TB of logs What would you need? A server!
  • 15. HADOOP – ARCHITECTURE BY EXAMPLE But 1TB is quite a lot of data… we want it quicker! Ok, what about distributed environment?
  • 16. HADOOP – ARCHITECTURE BY EXAMPLE So what about that Hadoop stuff?  Each node can: store data & process it (DataNode & TaskTracker)
  • 17. HADOOP – ARCHITECTURE BY EXAMPLE How about allocating jobs to slaves? We need a JobTracker!
  • 18. HADOOP – ARCHITECTURE BY EXAMPLE How about HDFS, how data blocks are assembled into files? NameNode does it.
  • 19. HADOOP – ARCHITECTURE BY EXAMPLE NameNode – manages HDFS metadata, doesn’t deal with files directly JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers TaskTracker – runs MapReduce operations DataNode – stores blocks of HDFS – default replication level for each block: 3
  • 20. HADOOP - LIMITATIONS DataNodes & TaskTrackers are fault tollerant NameNode & JobTracker are NOT! (existing workaround for this problem) HDFS deals nicely with large files, doesn’t do well with billions of small files
  • 21. MAP_REDUCE MapReduce – parallelisation approach Two main stages:  Map – do an actual bit of work, i.e.: extract info  Reduce – summarise, aggregate or filter outputs from Map operation For each job, multiple Map and Reduce operations – each may run on different node = parallelism
  • 22. MAP_REDUCE – AN EXAMPLE Let’s process 1TB of raw logs and extract traffic by host. After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations! Map - analyse logs and return them as set of <key,value> Reduce -> merge output of Map operations
  • 23. MAP_REDUCE – AN EXAMPLE Take a look at mocked log extract:[IP – bandwidth] – 123410.0.0.1 – 90010.0.0.2 – 123010.0.0.3 – 999
  • 24. MAP_REDUCE – AN EXAMPLE It’s important to define key, in this case IP<;2134><;1230><;999> Now, assume another Map operation returned:<;1500><;1000><;500>
  • 25. MAP_REDUCE – AN EXAMPLENow, Reduce will merge those results:<;3624><;2230><;1499><;500>
  • 26. MAP_REDUCE Selecting a key is important It’s possible to define composite key, i.e. IP+date For more complex tasks, it’s possible to chain MapReduce jobs
  • 27. HBASE Another layer on top of Hadoop/HDFS A distributed data storage Not a replacement for RDBMS! Can be used with MapReduce Good for unstructured data – no need to worry about exact schema in advance
  • 28. PIG – HBASE ENHANCEMENT HBase - missing proper query language Pig – makes life easier for HBase users Translates queries into MapReduce jobs When working with Pig or HBase, forget what you know about SQL – it makes your life easier
  • 29. BUILDING HADOOP CLUSTER Post production servers are ok Don’t take ‘cheap hardware’ too literally Good connection between nodes is a must! >=1Gbps between nodes >=10Gbps between racks 1 disk per CPU core More RAM, more caching!
  • 30. FINAL CONCLUSIONS Hadoop and NoSQL-like DB/DS scale very well Hadoop ideal for crunching huge data sets Does very well in production environment Cluster of slaves is fault tolerant, NameNode and JobTracker are not!
  • 31. EXTERNAL RESOURCES Trending Topic – build on Wikipedia access logs: Building web crawler with Hadoop: Analysing adverse drug events: Moving average for large data sets:
  • 33. QUESTIONS?