MAKING BIG DATA, SMALLUsing distributed systems for processing, analysing and managinglarge huge data sets Marcin Jedyk Software Professional’s Network, Cheshire Datasystems Ltd
WARM-UP QUESTIONS How many of you heard about Big Data before? How many about NoSQL? Hadoop?
AGENDA. Intro – motivation, goal and ‘not about…’ What is Big Data? NoSQL and systems classification Hadoop & HDFS MapReduce & live demo HBase
AGENDA Pig Building Hadoop cluster Conclusions Q&A
MOTIVATION Data is everywhere – why not to analyse it? With Hadoop and NoSQL systems, building distributed systems is easier than before Relying on software & cheap hardware rather than expensive hardware works better!
GOAL To explain basic ideas behind Big Data To present different approaches towards BD To show that Big Data systems are easy to build To show you where to start with such systems
WHAT IT IS NOT ABOUT? Not a detailed lecture on a single system Not about advanced techniques in Big Data Not only about technology – but also about its application
WHAT IS BIG DATA? Data characterised by 3 Vs: Volume Variety Velocity The interesting ones: variety & velocity
WHAT IS BIG DATA Data of high velocity: cannot store? Process on the fly! Data of high variety: doesn’t fit into relational schema? Don’t use schema, use NoSQL! Data which is impractical to process on a single server
NO-SQL Hand in and with Big Data NoSQL – an umbrella term for non-relational data bases or data storages It’s not always possible to replace RDBMS with NoSQL! (opposite is also true)
NO-SQL NoSQL DBs are built around different principles Key-value stores: Redis, Riak Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON) Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records
HADOOP Existed before ‘Big Data’ buzzword emerged A simple idea – MapReduce A primary purpose – to crunch tera- and petabytes of data HDFS as underlying distributed file system
HADOOP – ARCHITECTURE BY EXAMPLE Image you need to process 1TB of logs What would you need? A server!
HADOOP – ARCHITECTURE BY EXAMPLE But 1TB is quite a lot of data… we want it quicker! Ok, what about distributed environment?
HADOOP – ARCHITECTURE BY EXAMPLE So what about that Hadoop stuff? Each node can: store data & process it (DataNode & TaskTracker)
HADOOP – ARCHITECTURE BY EXAMPLE How about allocating jobs to slaves? We need a JobTracker!
HADOOP – ARCHITECTURE BY EXAMPLE How about HDFS, how data blocks are assembled into files? NameNode does it.
HADOOP – ARCHITECTURE BY EXAMPLE NameNode – manages HDFS metadata, doesn’t deal with files directly JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers TaskTracker – runs MapReduce operations DataNode – stores blocks of HDFS – default replication level for each block: 3
HADOOP - LIMITATIONS DataNodes & TaskTrackers are fault tollerant NameNode & JobTracker are NOT! (existing workaround for this problem) HDFS deals nicely with large files, doesn’t do well with billions of small files
MAP_REDUCE MapReduce – parallelisation approach Two main stages: Map – do an actual bit of work, i.e.: extract info Reduce – summarise, aggregate or filter outputs from Map operation For each job, multiple Map and Reduce operations – each may run on different node = parallelism
MAP_REDUCE – AN EXAMPLE Let’s process 1TB of raw logs and extract traffic by host. After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations! Map - analyse logs and return them as set of <key,value> Reduce -> merge output of Map operations
MAP_REDUCE – AN EXAMPLE Take a look at mocked log extract:[IP – bandwidth]10.0.0.1 – 123410.0.0.1 – 90010.0.0.2 – 123010.0.0.3 – 999
MAP_REDUCE – AN EXAMPLE It’s important to define key, in this case IP<10.0.0.1;2134><10.0.0.2;1230><10.0.0.3;999> Now, assume another Map operation returned:<10.0.0.1;1500><10.0.0.3;1000><10.0.0.4;500>
MAP_REDUCE – AN EXAMPLENow, Reduce will merge those results:<10.0.0.1;3624><10.0.0.2;2230><10.0.0.3;1499><10.0.0.4;500>
MAP_REDUCE Selecting a key is important It’s possible to define composite key, i.e. IP+date For more complex tasks, it’s possible to chain MapReduce jobs
HBASE Another layer on top of Hadoop/HDFS A distributed data storage Not a replacement for RDBMS! Can be used with MapReduce Good for unstructured data – no need to worry about exact schema in advance
PIG – HBASE ENHANCEMENT HBase - missing proper query language Pig – makes life easier for HBase users Translates queries into MapReduce jobs When working with Pig or HBase, forget what you know about SQL – it makes your life easier
BUILDING HADOOP CLUSTER Post production servers are ok Don’t take ‘cheap hardware’ too literally Good connection between nodes is a must! >=1Gbps between nodes >=10Gbps between racks 1 disk per CPU core More RAM, more caching!
FINAL CONCLUSIONS Hadoop and NoSQL-like DB/DS scale very well Hadoop ideal for crunching huge data sets Does very well in production environment Cluster of slaves is fault tolerant, NameNode and JobTracker are not!
EXTERNAL RESOURCES Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1 Building web crawler with Hadoop: http://goo.gl/xPTlJ Analysing adverse drug events: http://goo.gl/HFXAx Moving average for large data sets: http://goo.gl/O4oml