Intro to big data and hadoop   ubc cs lecture series - g fawkes
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Intro to big data and hadoop ubc cs lecture series - g fawkes

on

  • 499 views

 

Statistics

Views

Total Views
499
Views on SlideShare
451
Embed Views
48

Actions

Likes
0
Downloads
15
Comments
0

2 Embeds 48

http://www.linkedin.com 46
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Housekeeping: Keep your mobile devices on, turn up the ringer volume really loud, tweet, checkin on foursquare, update your facebook as I speak – we now live in a multi-tasking world so I’m ok with interruptions. <br /> Ask questions. If I don’t have the answer, someone else may, and you can drop me an email after. <br /> How many pages?! <br />
  • Introductory presentation for new hires at Teradata. Mixture of business and engineering concepts. Scratch the surface – references at the end of presentation. <br />
  • Zettabyte = 10 to the power of 21 <br />
  • Teradata used Tableau <br />
  • Baidu is chinese language version of Google. <br /> William Gibson, author, poet quote. Coined the term “cyberspace” in his 1982 book Neuromancer. Predicted the rise and popularity of reality TV. <br />
  • Structured Data – defined format, such as XML document or database tables <br /> Semi Structured Data – May be a schema but often ignored eg. spreadsheet, in which cells/fields can store any type of data <br /> Unstructured Data – no particular internal structure eg. plain text, image tile, twitter feed. <br /> 80% of Big Data is unstructured. <br />
  • If Gartner says so, it must be right ;>) Motivations for Hadoop: <br /> Huge dependency on network and huge bandwidth demands <br /> Scaling up and down is not a smooth process <br /> Partial failures are difficult to handle <br /> A lot of processing power is spent on transporting data <br /> Data synchronization is required during exchange <br /> As a developer you should not be worrying about these issues being handled by your application - - these are the problems that Hadoop solves, leaving you to focus on business logic. <br />
  • Basic I/O problem – while storage capacity of hard drives has increased, access speed (rate at which data can be read), has not. <br /> Eg. 1 TB drives are normal, but at 100 mega/bits transfer would take 2.5 hours to read all the data on the drive. <br />
  • The world continues to move towards commodity hardware. <br />
  • Commercial companies focused on developing and supporting Hadoop: Hortonworks, Cloudera, Amazon Web Services (AWS) <br />
  • In more simplistic terms, Hadoop is a framework that facilitates functioning of several machines together to achieve the goal of analyzing large sets of data. <br /> Hadoop framework supports reliability and data motion. MapReduce divides an application’s retrieval of data into many small fragments of work, each executed or re-executed on a node in the cluster. Data is stored on many compute nodes, providing very high aggregate bandwidth across the cluster for HDFS. Node failures are automatically handled by the framework, through parallelism, heartbeat, checksum and replication. <br />
  • The Hadoop platform consists of: Hadoop kernel (implemented in Java), MapReduce (any programming language used) and HDFS (Hadoop Distributed File System). HDFS can be accessed natively through a Java API for applications to use (a C language wrapper is also available) <br /> Ext3 – Third extended file system commonly used by Linux kernel is supported <br /> Xfs – Journaling file system supporting 64-bit and parallel I/O <br />
  • Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). An HDFS Block is greater than a Disk Block to minimize cost of seeks to disk. <br /> HDFS files are write-once. Once written are closed and cannot be changed. A typical single file in HDFS is Gigabytes-to-Terabytes in size. <br />
  • Terminology. A set of machines is a Hadoop cluster, using Master-Slave architecture. <br /> Each node in a Hadoop Instance, has a single NameNode and a cluster of DataNodes. A NameNode is the software to maintain file system structure and metadata for the Datanodes. A Datanode is the software to store and retrieve blocks of data. Can be up to 4,000 slave DataNodes per NameNode. <br /> NameNode Job Tracker takes care of MapReduce task execution tracking. DataNode Task Tracker takes care of MapReduce processing for write/read requests. NameNode does not require a lot of disk space, but requires a lot of RAM (the brains of the Instance). DataNode does not require a lot of RAM, but requires a lot of disk space. <br /> Failover – the transition from active NameNode to secondary/standby NameNode by a failover controller such as Zookeeper. <br />
  • HDFS is designed to run on commodity hardware. Low cost servers running Linux/Apache. <br /> Philosophy of the cluster design is to bring computing as close as possble to the data. <br /> All HDFS communication protocols are layered on top of the TCP/IP protocol. NameNode and Datanodes can be located anywhere. <br />
  • A single instance is a single HDFS cluster. <br />
  • A single instance is a single HDFS cluster. <br />
  • Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). <br /> Hardware and data corruption is the norm, rather than the exception. An HDFS instance may consist of hundreds or 1000s of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS are always non-functional (dead). <br /> By default, each block is replicated 3 times (can be changed by application in configuration). Replica placement is heavily studied for optimization - HDFS’s policy is to put one replica on one node in the local rack and distribute other replicas to other nodes and other racks, with the goal to reduce seek times, and encourage cluster rebalancing. <br /> Separate from file operations, the NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. <br /> If NameNode itself crashes, backup will have to be restored from disk. The Zookeeper tool provides NameNode failover coordination, through high availability of active/passive NameNodes. <br />
  • Analogy to UNIX is a large distributed pipeline <br />
  • Map Server/Function 1, Map Server/Function 2, Map Server/Function 3: each process in parallel <br /> In MapReduce, every input is viewed as a Key-Value pair. Eg. Key=Sentence 1, Value=“John has a red car, which has no radio”. <br /> Step 1 – Each sentence is given to a Map, and each word is counted in a wave. In this example, there are 3 Map jobs. <br /> Step 2 – Shuffle and sort simply moves the words to server locations, where all the unique keys are brought together. <br /> Step 3 – The words on each server are aggregated, and reduced. In this example Reduce is performed across two waves. Final output on lower right. <br />
  • As a developer you have to start thinking about your data storage problem in a distributed way, instead of in a monolithic way. <br />
  • Step 1 – data is broken into file splits of 64 MB (or 128 MB) and the blocks are moved to different NameNodes <br /> Step 2 – Once all the blocks are moved, the Hadoop framework passes on your program to each NameNode <br /> Step 3 – Job Tracker then starts scheduling the programs on individual Datanodes <br /> Step 4 – Once all the Datanodes are done, the output (yellow) is written back <br />
  • Also built on top of Hadoop, are the helper applications: <br /> Hive – interactive SQL query and modeling using datawarehouse view of HDFS. Projects a table structure on the dataset and then manipulates it with HiveQL. <br /> Pig – Data flow for tedious MapReduce jobs. A language for expressing data analysis and infrastructure processes. <br /> HBase – Columnar NoSQL store for billions of rows <br /> HCatalog – Table and schema management <br /> Zookeeper – NameNode to backup failover coordination <br /> Ambari – management tool <br />
  • Download commercial implementations: Hortonworks (Sandbox is a single node download), Cloudera, Amazon services <br />
  • Question is not “Why should I care about Big Data”, but rather, how can I get closer to Big Data and start taking advantage of it. <br /> Thanks to Peter Smith and Michel Ng to organizing. If you have a topic you would like to present on, see Peter – contribute your expertise to the tech ecosystem in Vancouver <br /> Send me questions via LinkedIn and copy will be posted to my profile <br /> Hootsuite, Quickmobile, a few others in Vancouver looking for analytics developers – have a look <br />

Intro to big data and hadoop ubc cs lecture series - g fawkes Presentation Transcript

  • 1. Introduction to Analytics and Big Data - Hadoop The University of British Columbia Computer Science Alumni/Industry Lecture Series Geoff Fawkes November, 2013 © 2013 Geoff Fawkes. All Rights Reserved. 1 / 450
  • 2. Who am I?  Director Engineering, Teradata  HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various engineering roles  Technology executive, mentor, software engineer  B.Sc. Comp Sci (UBC), MBA Executive (SFU)  Interruptive (disruptive?) personality   Please ask questions to me / each other as we go along I don’t have all the answers – you do!  Credits: Rob Pegler, SNIA Education  Storage Networking Industry Association, 2012  Who’s paying attention - 450 slides page count?  Not that “big” - - about 50 © 2013 Geoff Fawkes. All Rights Reserved. 2
  • 3. Big Data and Hadoop  History  Data Challenges  Why Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 3
  • 4. Customer Challenges: The Data Deluge © 2013 Geoff Fawkes. All Rights Reserved. 4
  • 5. Big Data is Different than Business Intelligence © 2013 Geoff Fawkes. All Rights Reserved. 5
  • 6. Questions From Business Will Vary © 2013 Geoff Fawkes. All Rights Reserved. 6
  • 7. Web 2.0 is “Data Driven” © 2013 Geoff Fawkes. All Rights Reserved. 7
  • 8. The World of Data-Driven Applications © 2013 Geoff Fawkes. All Rights Reserved. 8
  • 9. Attributes of Big Data © 2013 Geoff Fawkes. All Rights Reserved. 9
  • 10. Top Ten Common Big Data Problems © 2013 Geoff Fawkes. All Rights Reserved. 10
  • 11. Industries Are Embracing Big Data © 2013 Geoff Fawkes. All Rights Reserved. 11
  • 12. Why Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 12
  • 13. Why Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 13
  • 14. Storage and Memory B/W Lagging CPU © 2013 Geoff Fawkes. All Rights Reserved. 14
  • 15. Commodity Hardware Economics © 2013 Geoff Fawkes. All Rights Reserved. 15
  • 16. What is Hadoop?  Hadoop Adoption  HDFS  MapReduce  Examples  Ecosystem Projects © 2013 Geoff Fawkes. All Rights Reserved. 17
  • 17. Hadoop Adoption in the Industry © 2013 Geoff Fawkes. All Rights Reserved. 18
  • 18. What is Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 19
  • 19. What is Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 20
  • 20. HDFS 101 – The Data Set System © 2013 Geoff Fawkes. All Rights Reserved. 21
  • 21. HDFS Organization and Replication © 2013 Geoff Fawkes. All Rights Reserved. 22
  • 22. Hadoop Server Roles - Multiple © 2013 Geoff Fawkes. All Rights Reserved. 23
  • 23. Hadoop Cluster © 2013 Geoff Fawkes. All Rights Reserved. 24
  • 24. HDFS File Write Operation - Instance © 2013 Geoff Fawkes. All Rights Reserved. 25
  • 25. HDFS File Read Operation - Instance © 2013 Geoff Fawkes. All Rights Reserved. 26
  • 26. HDFS File Operation R/W Replication © 2013 Geoff Fawkes. All Rights Reserved. 27
  • 27. MapReduce 101 – Functional Programming Meets Distributed Processing © 2013 Geoff Fawkes. All Rights Reserved. 28
  • 28. What is MapReduce? © 2013 Geoff Fawkes. All Rights Reserved. 29
  • 29. Key MapReduce Terminology © 2013 Geoff Fawkes. All Rights Reserved. 30
  • 30. MapReduce Basic Concepts © 2013 Geoff Fawkes. All Rights Reserved. 31
  • 31. Example 1: MapReduce Operation © 2013 Geoff Fawkes. All Rights Reserved. 32
  • 32. Example 2: Sample Dataset © 2013 Geoff Fawkes. All Rights Reserved. 33
  • 33. MapReduce Paradigm – UNIX Cmd © 2013 Geoff Fawkes. All Rights Reserved. 34
  • 34. Example 3: Count Words © 2013 Geoff Fawkes. All Rights Reserved. 35
  • 35. Ex. 3: Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job © 2013 Geoff Fawkes. All Rights Reserved. 36
  • 36. Ex. 3: Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job © 2013 Geoff Fawkes. All Rights Reserved. 37
  • 37. Ex. 3: Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? © 2013 Geoff Fawkes. All Rights Reserved. 38
  • 38. MapReduce Job Configuration Parms  190+ parameters in Hadoop  Set manually or defaults are used © 2013 Geoff Fawkes. All Rights Reserved. 39
  • 39. Putting it all Together: MapReduce + HDFS © 2013 Geoff Fawkes. All Rights Reserved. 40
  • 40. Hadoop Ecosystem Projects - Interactive SQL Query & Modeling - Data flow for tedious MapReduce Jobs - Columnar NoSQL Store © 2013 Geoff Fawkes. All Rights Reserved. 41
  • 41. Compare: Hadoop, SQL, Massively Parallel Processing (MPP) © 2013 Geoff Fawkes. All Rights Reserved. 42
  • 42. Compare: RDBMS and MapReduce © 2013 Geoff Fawkes. All Rights Reserved. 43
  • 43. Hadoop Use Cases  Set Top Cable TV Boxes  Pay Per View Advertising  Bank Risk Modelling  Product Sentiment Analysis © 2013 Geoff Fawkes. All Rights Reserved. 44
  • 44. Example 1: Set Top Cable TV Boxes © 2013 Geoff Fawkes. All Rights Reserved. 45
  • 45. Example 2: Pay Per View Advertising © 2013 Geoff Fawkes. All Rights Reserved. 46
  • 46. Example 3: Bank Risk Modelling © 2013 Geoff Fawkes. All Rights Reserved. 47
  • 47. Example 4: Product Sentiment Analysis © 2013 Geoff Fawkes. All Rights Reserved. 48
  • 48. More Reading?  World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011  McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity  Big Data: Harnessing a game-changing asset  IDC: 2011 Digital Universe Study: Extracting Value from Chaos  The Economist: Data, Data Everywhere  Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field  O’Reilly – What is Data Science?  O’Reilly – Building Data Science Teams?  O’Reilly – Data for the public good  Obama Administration “Big Data Research and Development Initiative.” © 2013 Geoff Fawkes. All Rights Reserved. 49
  • 49. Introduction to Analytics and Big Data – Hadoop Q&A Geoff Fawkes http://www.linkedin.com/pub/geoff-fawkes/1/269/202 @gfawkes November, 2013 © 2013 Geoff Fawkes. All Rights Reserved. 50