A gentle introduction to the world of BigData and Hadoop


Published on

A gentle introduction to the world of #BigData and #Hadoop with also a fast view of what you can do in Azure

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Example of data to storetransactions (financial, government related)logs (records of activity, location)business data (product catalogs, prices, customers)user data (images, documents, video)sensor data (temperature, pollution)medical data (x-rays, brain activity records)social (email, twitter etc)
  • Lambda architecture: “community” driven architecture providing a way for different BigData components to work together
  • The batch layer run on a while(true) and recomputes the batch view from scratchIt’s quite simple to implement
  • The speed layer will maintain the same key of the batch layer, so it will be able to recognize and select the same data.The different is that this layer will modify the data will receiving the data.
  • RECAP…Usually the Batch Layer is implemented with HDFS – Hadoop Distributed File SystemServing Database : ElephantDB, Hbase…Speed Layer: Cassandra (map with a sorted map as a value), or Cassandra with Storm (stream access), or in memory DB
  • RECAP…
  • Exmaple:In the cloud, on an elastic first level system, the service should be “stateless” or at least “soft-state” (cached) and must always response to the query, even if the backend is down. So the system will be “A”, immediate responsive, and “P”, regardless a failure in the backend the system is responding to the requests
  • Using SQL Server 2008 R2, Yahoo! enhanced its Targeting, Analytics and Optimization (TAO) infrastructure Key Points: With Big Data technology, Yahoo experienced the following benefits:Improved ad campaign effectiveness and increased advertiser spending.Cube producing 24 terabytes of data quarterly, making it the world’s largest SQL Server Analysis Services cube.Ability to handle more than 3.5 billion daily ad impressions, with hourly refresh rates.References: Microsoft case study: Yahoo! Improves Campaign Effectiveness, Boosts Ad Revenue with Big Data Solution: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707
  • A gentle introduction to the world of BigData and Hadoop

    1. 1. Hello, A “gentle” introduction to the world of Big Data and the Hadoop platform
    2. 2. Agenda 1. Introduction • The history, the #BigData, a bit of theory behind… 2. What is Hadoop, part 1 • Introducing HDFS and Map/Reduce 3. What is Hadoop, part 2 1. The next generation (v. 2.x), Real time, … 4. Microsoft and Big Data 1. Lambda architecture and Windows Azure, WA Storage(s), WA HDInsight 5. Q&A
    3. 3. Who am I? (Who bothers?  ) Stefano Paluello • • • Tech Lead @ SG Gaming All around geek, passionate about architecture, Cloud and Data Co-founder of various start-up(s) 
    4. 4. How it all started…
    5. 5. Ops….
    6. 6. history • 2002 • 2003 • 2004 Hadoop, created by Doug Cutting (part of the Lucene project), starts as an Open Source search engine for the Web. It has its origins in Apache Nutch, parts of the Lucene project (full text search engine). Google publishes a paper describing its own distributed file system, also called GFS. The first version of NDFS, Nutch Distributed FS, implementing the Google’s paper.
    7. 7. history • 2004 Google publishes, another, paper introducing the MapReduce algorithm • 2005 The first version of MapReduce is implemented in Nutch • 2005 (end) • 2006 (Feb) Nutch’s MapReduce is running on NDFS Nutch’s MapReduce and NDFS became the core of a new Lucene’s subproject:
    8. 8. history • 2008 Yahoo launches the World’s largest Hadoop PRODUCTION site Some Webmap size data: • # of links between pages in the index: roughly 1 trillion (1012) links • Size of the output: over 300 TB, compressed (!!!) • # of cores to run a single MapReduce job: over 10000 • Raw disk used in the production cluster: Over 5 Petabytes
    9. 9. OK, let’s start with… … a bit of theory
    10. 10. Nooo, Wait! Don’t run away
    11. 11. What is #BigData? BigData is a definition, but for someone is a buzzword (a keyword with no or not precise meaning but sounding interesting) that is trying to address all this “new” (really?!?) needing of processing a lot of data. To identify we usually use the “Three V” to define BigData
    12. 12. The 3 V’s of #BigData? Volume: the size of the data that we’re dealing with Variety: the data is coming from a lot of different sources Velocity: the speed at which the data is generated
    13. 13. Source: www.wipro.com, July 2012
    14. 14. And the 4Vs of #BigData? Font: www.wipro.com, July 2012 Source: Oracle.com
    15. 15. the 4Vs of #BigData (2) Source: IBM.com
    16. 16. #BigData It is predicted that between 2009 and 2020, the estimated size of the “digital universe” will grow around 35 Zettabytes (270 bytes) per year (!!!) 1 Zettabyte = 1k Exabyte or 1M Petabyte or 1G Terabyte Font: www.wipro.com, July 2012 #BigData market and analysis and the 3Vs definition, was introduced by a Gartner research about 13 years ago http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiminggartners-volume-velocity-variety-construct-for-big-data/
    17. 17. Big Data Lambda Architecture What??? Lam…what???
    18. 18. I said LAMBDA !!!
    19. 19. Lambda Architecture Solves the problem of computing arbitrary functions on arbitrary data by decomposing the problem in three layer: The batch layer The serving layer The speed layer
    20. 20. The Batch layer Stores all the data in an immutable, constantly growing dataset Accessing all the data is too expensive (even if possible) Precompute “query” functions are created (aka “batch view”, high latency operations) allowing the results to be accessed quickly
    21. 21. The Batch layer Source: “Big Data”, by Manning
    22. 22. The Serving layer Indexes the batch views Loads the batch views and allows to access and query them efficiently Usually is a distributed database that loads in the batch views and it’s updated by the batch layer It requires batch updates and random reads but does NOT require random writes.
    23. 23. The Speed layer Compensate for high latency updates of the serving layer Provides fast incremental algorithms Updates the realtime view while receiving new data, without computing them like the batch layer)
    24. 24. The Speed layer Source: “Big Data”, by Manning
    25. 25. Recap Source: “Big Data”, by Manning
    26. 26. Distributed Data 101 Just a couple of reminders…
    27. 27. ACID ACID is a set of properties that guarantee that database transactions are processed reliability [ Source: Wikipedia ] Atomicity: or “all or nothing”. All the modification in a transaction must happen successfully or no changes are committed Consistency: all my data will be always in a valid state after every transactions. Isolation: transactions are isolated, so any transaction is separated and won’t affect the data of other transactions Durability: once a transaction is committed, the related data are safely and durably stored, regardless to errors, crashes or any software malfunctions
    28. 28. CAP CAP theorem (or Brewer’s theorem) is a set of basic requirements that describes a distributed system Consistency: all the server in the system will have the same data Availability: all the server in the system will be available and they will return all the data available (also if they could be not consistent across the system) Partition (tolerance): the system will continues to operate as a whole despite arbitrary message loss or failure of a part of the system According to the theorem, a distributed system CANNOT satisfy all the three requirements at the SAME time (“two out of three” concept).
    29. 29. Here we are… Your “#BigData 101” degree!
    30. 30. What is Hadoop? (Part 1)
    31. 31. Hadoop… Where it comes from? The “legend” says that the name comes from Doug Cutting (one of the founder of the project) son’s toy elephant. So it is also the logo of the yellow smiling elephant.
    32. 32. Hadoop cluster A Hadoop cluster consist in mainly two modules: A way to store distributed data, the HDFS or Hadoop Distributed File System (storage layer) A way to process data, the MapReduce (compute layer) This is the core of Hadoop!
    33. 33. HDFS The Hadoop Distributed File System For a developer point of view it looks like a standard file system Runs on top of OS file system (extf3,…) Designed to store a very large amount of data (petabytes and so on) and to solve some problems that comes with DFS e NFS Provides fast and scalable access to the data Stores data reliably
    34. 34. How does this…
    35. 35. HDFS under the hood All the files loaded in Hadoop are split into chunks, called blocks. Each block has a fixed size of 64Mb (!!!). Yes, Megabytes! MyData – 150Mb Blk_01 64Mb HDFS Blk_02 64Mb Blk_03, 22Mb
    36. 36. Datanode(s) and Namenode Datanode is a daemon (a service in the Windows language) running on each cluster nodes, that is responsible to store the blocks Namenode, is a dedicated node where all the metadata of all the files (blocks) inside my system are stored. It’s the directory manager of the HDFS To access a file, a client contact the Namenode to retrieve the list of locations for the blocks. With the locations the client contact the Datanodes to read the data (possibly in parallel).
    37. 37. Data Redundancy Hadoop replicates each block THREE times, as it’s stored in the HDFS. The location of every blocks is managed by the Namenode If a block is under-replicated (due to some failures on a node), the Namenode is smart enough to create another replica, until each node has three replica inside the cluster Yes… you made your homework! If I have 100Tb of data to store in Hadoop, I will need 300Tb of storage space.
    38. 38. Datanode(s) and Namenode D D NN D D D
    39. 39. Namenode availability If the Namenode fails ALL the cluster becomes inaccessible In the early versions the Namenode was a single point of failure Couple of solution are now available: the Namenode stores the data on the network through NFS most production sites have two Namenode: Active and Standby
    40. 40. HDFS Quick Reference The HDFS are pretty easy to use and to remember (specially if you come from a *nix like environment The commands usually have the “hadoop fs” prefix To list the content of a HDFS folder > hadoop fs –ls To load a file in the HDFS > hadoop fs –put <file> To read a file loaded into HDFS > hadoop fs –tail <file> And so on… >hadoop fs –mkdir <dir> >hadoop fs –mv <sourcefile> <destfile> >hadoop fs –rm <file>
    41. 41. MapReduce
    42. 42. MapReduce Processing large file serially could be a problem. MapReduce is designed to be a very parallelized way of managing data Data are split into many pieces Each piece is processed simultaneously and isolated Data are processed in isolation by tasks called Mappers. The result of the Mappers, is then brought together (with a process called “Shuffle and Sort”) into a second set of tasks, Reducers.
    43. 43. Mappers
    44. 44. Reducers
    45. 45. The MapReduce “HelloWorld” All the examples and tutorials of MapReduce start with one simple example: the Wordcount. Let’s take a look at it.
    46. 46. Java code…
    47. 47. Using Hadoop Streaming Hadoop Streaming allows to write Mappers and Reducers in almost any language, rather than forcing you to use Java The command to run the streaming it’s a bit “tricky”
    48. 48. MapReduce on a “real” case Retailer with many stores around the country The data are written on a sequential log with date, store location, item, price, payment 2014-01-01 2014-0101 …. London NewCastle Clothes 13.99£ Card Music 05.69£ Bank A really simple mapper will simply split all the data and then pass them to a mapper The mapper will calculate the Sales Total split for every location
    49. 49. Python code…
    50. 50. How MapReduce works…
    51. 51. … and the Streaming
    52. 52. Hadoop related projects PIG: high level language fro analyzing large data-sets. It’s working as a compiler that produce M/R jobs HIVE: data warehouse software facilities querying and managing large data-sets with a SQL like language Hbase : a scalable, distributed database that supports structured data storage for large tables Cassandra: a scalable multi-master database
    53. 53. What is Hadoop? (Part 2)
    54. 54. Hadoop v 2.x Hadoop is a pretty easy system to use, but a bit tricky to set-up and manage The skills required are more related to System Management than the Dev side Let’s add that the Apache documentation never stood up for clarification and completeness So, to add a bit of mess, they decided to make the v2, that is actually changing a lot 
    55. 55. Hadoop v 2.x The new Hadoop has now FOUR modules (instead of two) HadoopCommon: common utilities supporting all the other modules HDFS: an evolution of the previous distributed FS Hadoop YARN: a fx for job scheduling and cluster resource management Hadoop MapReduce: a YARN based system for paralllel processing of large data sets
    56. 56. Hadoop v 2.x Hadoop v2, leveraging YARN, is aiming to become the new OS for the data processing
    57. 57. Hadoop and real time Hadoop v2, using YARN, and Storm (a free and open source distributed real time computation system) can compute your data in real time Some Hadoop distribution (like Hortonworks) are working on an effortless integration http://hortonworks.com/blog/stream-processing-inhadoop-yarn-storm-and-the-hortonworks-data-platform/
    58. 58. Microsoft Azure and Hadoop
    59. 59. Microsoft Lambda Architecture support Batch Layer • WA HDInsight • WA Blob storage • MapReduce, Hive, Pig,… Speed Layer • Federation in WA SQL DB • Azure Tables • Memchached/Mongo DB • SQL Azure • Reactive Extensions (Rx) Serving Layer • Azure Storage Explorer • MS Excel (and Office suite) • Reporting Services • Linq To Hive • Analysis Services
    60. 60. Yahoo, Hadoop and SQL Server Apache Hadoop Staging Database SQL Server Analysis Service (SSAS Microsoft Excel and PowerPivot Other BI Tools and Custom Applications SQL Server Connector (Hadoop Hive ODBC) SQL Server Analysis Services (SSAS Cube) Hadoop Data Third Party Database + Custom Applications
    61. 61. MS. Net SDK for Hadoop • .NET client libraries for Hadoop • Write MapReduce in Visual Studio using C# or F# • Debug against local data Slave Nodes
    62. 62. WebClient Libraries in .Net • WebHDFS client library: works with files in HDFS and Windows Azure Blob storage • WebHCat client library: manages the scheduling and execution of jobs in an HDInsight cluster WebHDFS WebHCat • Scalable REST API • HDInsight job scheduling and execution • Move files in and out and delete from HDFS • Perform file and directory functions
    63. 63. Reactive Extensions (Rx): Pulling vs. Pushing Data Interactive vs Reactive • In interactive programming, pulling data from a sequence that represents the source (IEnumerator) • In reactive programming, subscribing to a data stream (called observable sequence in Rx), with updates handed to it from the source
    64. 64. Reactive Extensions (Rx): Pulling vs. Pushing Data Application Move Next On Next Got next? Interactive Reactive IEnumerable<T> IEnumerator<T> Have next! IObservable<T> IObserver<T>