Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Big Data Technologies

642 views

Published on

This presentation give an introduction to Big Data technologies such as Hadoop, NoSQL, Sqoop, Hive and HBase

Published in: Data & Analytics

Introduction to Big Data Technologies

  1. 1. Introduction to 
 Big Data Technologies Eakasit Pacharawongsakda, Ph.D. eakasit.pac@dpu.ac.th 1 July 2017 Walailak University, Nakhon Si Thammarat
  2. 2. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 2
  3. 3. ในหนึ่งวันทำงาน
  4. 4. source:http://pad1.whstatic.com/images/thumb/a/aa/Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg/ aid196018-728px-Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg เวลา 07:00 น. ออกเดินทางไปทำงาน
  5. 5. source: http://www.clipartkid.com/images/259/research-and-report-writing-9-23-12-9-30-12-q2r0wg-clipart.jpg เวลา 07:45 น. ยังคงติดอยู่บนถนน
  6. 6. เวลา 08:00 น. เจ้านายโทรศัพท์เข้ามาถามงาน source: https://d1ai9qtk9p41kl.cloudfront.net/assets/mc/psuderman/2011_07/text-drive.png
  7. 7. เวลา 08:05 น. ขับรถไปชนกับคันอื่น
  8. 8. เวลา 10:00 น. ถึงที่ทำงานและทำงานต่อไป source: http://stuffpoint.com/anime-and-manga/image/285181-anime-and-manga-girl-working-in-the-computer.jpg
  9. 9. เวลา 18:00 น. แวะซื้อของกลับบ้าน
  10. 10. เวลา 20:00 น. กลับถึงบ้านและอยู่คนเดียว
  11. 11. ในหนึ่งวันทำงานกับ
 เทคโนโลยีข้อมูลขนาดใหญ่ (Big Data)
  12. 12. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU ระบบนำทาง • แอพพลิเคชัน Waze 12
  13. 13. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU ระบบนำทาง • แอพพลิเคชัน Waze 13
  14. 14. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU รถที่ไม่ต้องมีคนขับ (self driving car) • Waymo (Google self-driving car) 14
  15. 15. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU แผงไข่อัจฉริยะ • Egg Minder 15
  16. 16. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU ร้านค้าที่ไม่ต้องรอคิว • Amazon Go 16
  17. 17. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU เทคโนโลยีที่ทำให้ชีวิตประจำวันสะดวกขึ้น 17
  18. 18. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU ทำไมผู้หญิงถึงโสด 18 source: https://pishetshotisak.wordpress.com/2016/12/07/ทำไมผู้หญิงถึงขึ้นคาน-ค/
  19. 19. คนเรามักชอบอะไรใหญ่ๆ
  20. 20. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data & Analytics • Big Bang 20 source:http://www.thetechy.com/science/exploring-universe-curiosity
  21. 21. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data & Analytics • Big Architecture (Great wall of China) 21 source: http://www.history.com/topics/great-wall-of-china
  22. 22. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data & Analytics • Big Data 22source: http://www.plmjim.com/?p=583
  23. 23. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Data Evolutions 23 source:Data Science and Big Data Analytics: Discovering, analyzing, visualizing and presenting data
  24. 24. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? 24 source: https://www.youtube.com/watch?v=TzxmjbL-i4Y
  25. 25. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? 25 source: http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.html#
  26. 26. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? • Big Data ประกอบด้วย 3 V • Volume • ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล • Velocity • ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว • Variety • ข้อมูลมีความหลากหลายมากขึ้น 26 source: https://upxacademy.com/beginners-guide-to-big-data/
  27. 27. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? • Huge volume of data • ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ เป็นล้านคอลัมน์ (million columns) 27
  28. 28. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data: Volume 28 source:https://datafloq.com/read/infographic/226
  29. 29. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data: Volume 29 source:https://www.adeptia.com
  30. 30. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? • Huge volume of data • ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ เป็นล้านคอลัมน์ (million columns) • Speed of new data creation and growth • ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ 30
  31. 31. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data: Velocity 31 source: https://upxacademy.com/beginners-guide-to-big-data/
  32. 32. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? • Huge volume of data • ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ เป็นล้านคอลัมน์ (million columns) • Speed of new data creation and growth • ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ • Complexity of data types and structures • ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip) 32
  33. 33. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data: Variety 33 source: https://upxacademy.com/beginners-guide-to-big-data/
  34. 34. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Big Data: Variety 34 source: https://upxacademy.com/beginners-guide-to-big-data/
  35. 35. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU What is Big Data? 35 source: http://dataconomy.com/2014/08/infographic-how-to-explain-big-data-to-your-grandmother/
  36. 36. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Internet of Things 36source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/
  37. 37. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Sensors 37source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/
  38. 38. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications 38
  39. 39. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications • Disney’s Magic Band 39 source:https://disneyworld.disney.go.com/plan/my-disney-experience/bands-cards/#?CMP=SEC-WDWShareEmailNGE-MDX-MagicBand-video&video=0/0/0/0
  40. 40. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications • GlowCaps 40 source:http://www.vitality.net/glowcaps.html
  41. 41. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications • Connected Toothbrush 41 source:https://www.youtube.com/watch?v=gLpUxDdh9iQ
  42. 42. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications 42 source:https://www.youtube.com/watch?v=TqRN7r7mGmk
  43. 43. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications 43
  44. 44. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU IoT applications • iBeacon 44 source: https://www.mallmaverick.com/system/site_images/photos/000/001/700/original/blog_ibeacon1.jpg?1391033561
  45. 45. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 45
  46. 46. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Relational database & SQL • Databases are made up of tables and each table is made up of rows and columns • SQL is a database interaction language that allows you to add, retrieve, edit and delete information stored in databases 46 ID Mark Code Title S103 72 DBS Database Systems S103 58 IAI Intro to AI S104 68 PR1 Programming 1 S104 65 IAI Intro to AI S106 43 PR2 Programming 2 S107 76 PR1 Programming 1 S107 60 PR2 Programming 2 S107 35 IAI Intro to AI
  47. 47. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Relational database & SQL • SQL primarily works with two types of operations to query data • Read consists of the SELECT command, which has three common clauses • SELECT • FROM • WHERE 47image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
  48. 48. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Relational database & SQL 48image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
  49. 49. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL? • Relational databases have been the dominate type of database used for application for decades. • With the advent of the Web, however, the limitations of relational databases became increasingly problematic. • Companies such as Google, LinkedIn, Yahoo! and Amazon found that supporting large numbers of users on the Web was different from supporting much smaller numbers of business users. 49
  50. 50. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL? 50image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c
  51. 51. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL? • Web application needed to support • Large volumes of read and write operations • Low latency response times • High availability • These requirement were difficult to realise using relational databases. • There are limits to how many CPUs and memory can be supported in a single server. • Another option is to use multiple servers with a relational database. • operating single RDBMS over multiple servers is a complex operation 51
  52. 52. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL? • NoSQL is “Not Only SQL” • Four characteristics of data management for large-scale data management tasks are • Scalability • Cost • Flexibility • Availability 52
  53. 53. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Scalability • Scalability is the ability to efficiently meet the needs for varying workloads. • For example, if there is a spike in traffic to a website, additional servers can be brought online to handle the additional load. • When the spike subsides and traffic returns to normal, some of those additional servers can be shut down. • Adding servers as needed is called scaling out. 53
  54. 54. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Scalability • Scaling Up • Scaling Out 54
  55. 55. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Scalability • Scaling out is more flexible than scaling up. • Servers can be added or removed as needed when scaling up. • NoSQL are designed to utilise servers available in a cluster with minimal intervention by database administrators. 55
  56. 56. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Cost • Commercial software vendors employ a variety of licensing models that include charging by • the size of the server running the RDBMS • the number of concurrent users on the database • the number of named users allowed to use the software • The major NoSQL databases are available as open source. It’s free to use on as many servers of whatever size needed 56
  57. 57. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Cost 57image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c
  58. 58. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Flexibility • Database designers expect to know at the start of a project all the tables and columns that will be needed to support an application. • It is also commonly assumed that most of the columns in a table will be needed by most of the rows. • Unlike relational databases, some NoSQL databases do not require a fixed table structure. • For example, in a document database, a program could dynamically add new attributes as needed without having to have a database designer alter the database design. 58
  59. 59. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Why NoSQL?: Availability • Many of us have come to expect websites and web applications to be available whenever we want to use them. • NoSQL databases are designed to take advantage of multiple, low-cost servers. • When one server fails or is taken out of service for maintenance, the other servers in the cluster can take on the entire workload. 59
  60. 60. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 60
  61. 61. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Key-Value databases • Key-value databases are the simplest form of NoSQL databases. • These databases are modelled on two components: 
 keys and values • Data is stored in a key-value pairs, where attribute is the Key and content is the Value • Data can only be queries and retrieved using the key only. 61
  62. 62. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Key-Value databases • use cases • caching data from relational databases to improve performance • storing data from sensors (IoT) • software • redis • Amazon DynamoDB 62 3876941. accountNumber Jane Washington1. Name 31.numItems Loyalty Member1.custType Keys Values
  63. 63. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Key-Value databases • Redis example (http://try.redis.io) • Set or update value against a key: • SET university "DPU" // set string • GET university // get string • HSET student firstName "Manee" // Hash – set field value • HGET student firstName // Hash – get field value • LPUSH "alice:sales" "10" "20" // List create/append • LSET "alice:sales" "0" "4" // List update • LRANGE "alice:sales" 0 1 // view list 63
  64. 64. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Key-Value databases • Set or update value against a key: • SET quantities 1 • INCR quantities • SADD "alice:friends" "f1" "f2" //Set – create/update • SADD "bob:friends" "f2" "f1" //Set – create/update • Set operations: • intersection • SINTER "alice:friends" "bob:friends" • union • SUNION "alice:friends" “bob:friends" 64
  65. 65. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 65
  66. 66. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • A document store allows the inserting, retrieving, and manipulating of semi-structured data. • Compared to RDBMS, the documents themselves act as records (or rows), however, it is semi-structured as compared to rigid RDBMS. • It can store the data that have different set of data fields (columns) • Most of the databases available under this category use XML, JSON 66
  67. 67. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • Document examples 67 { “EmployeeID" : "SM1", "FirstName" : "Anuj", "LastName" : "Sharma", "Age" : 45, "Salary" : 10000000 } { "EmployeeID": "MM2", "FirstName" : "Anand", "Age" : 34, "Salary" : 5000000, “Address" : { "Line1" : "123, 4th Street", "City" : "Bangalore", "State" : "Karnataka" }, "Projects" : [ "nosql-migration", "top-secret-007" ] }
  68. 68. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • Use cases • back-end support for websites with high volumes of reads and writes • applications that use JSON data structures such as twitter data • Software • MongoDB • Couchbase • IBM Cloudant 68
  69. 69. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • MongoDB examples • Download MongoDB from https://www.mongodb.com/download- center?jmp=nav#community • MongoDB’s default data directory path is the absolute path datadb on the drive from which you start MongoDB • You can specify an alternate path for data files using the --dbpath option to mongod.exe • Import example data 69 "C:Program FilesMongoDBServer3.4binmongod.exe" --dbpath d:testmongodbdata mongoimport --db test --collection restaurants --drop --file downloads/primer-dataset.json
  70. 70. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • MongoDB examples • Download and install Robomongo (https://robomongo.org/ download) 70
  71. 71. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • MongoDB examples • Find bakery’s shop
 • Find restaurants in “Morris Park Ave” street • Find restaurants which zip code start with 100 • Find bakery’s shop at “Morris Park Ave” street 71
  72. 72. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Document Databases • MongoDB examples • Find bakery’s shop and show their grades • Find bakery’s shop and show their cuisine and grades • More examples, please visit https://docs.mongodb.com/getting- started/shell/query/ 72
  73. 73. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 73
  74. 74. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • Store data as columns as opposed to rows that is prominent in RDBMS • A relational database shows the data as two-dimensional tables comprising of rows and columns but stores, retrieves, and processes it one row at a time • A column-oriented database stores each column continuously. i.e. on disk or in-memory each column on the left will be stored in sequential blocks. 74
  75. 75. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • Example table 75image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
  76. 76. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • Advantages of column-based tables: • Faster Data Access: • Only affected columns have to be read during the selection process of a query. Any of the columns can serve as an index. • Better Compression: • Columnar data storage allows highly efficient compression because the majority of the columns contain only few distinct values (compared to number of rows). 76
  77. 77. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • Advantages of column-based tables: • Better parallel Processing: • In a column store, data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. • If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core. 77
  78. 78. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • In case of analytic applications, where aggregations are used and faster search & processing are required, row-based storage are not good. • In row based tables all data stored in a row has to be read even though the requirement may be there to access data from a few columns. • Hence, these queries on huge amounts of data would take lots of times. • In columnar tables, this information is stored physically next to each other, that significantly increases the speed of certain data queries. 78
  79. 79. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • Column storage is most useful for OLAP queries (queries using any SQL aggregate functions). Because, these queries get just a few attributes from every data entry. • But for traditional OLTP queries (queries not using any SQL aggregate functions), it is more advantageous to store all attributes side-by-side in row tables 79
  80. 80. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases 80image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
  81. 81. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases 81image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
  82. 82. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases 82 Operation Column- oriented Row- oriented Aggregate Calulation of Single Column e.g. sum(price) Fast Slow Compression Higher - Retrieval of a few columns from a table with many columns Fast Slow Insertion/Updating of single new record Slow Fast Retrieval of a single record Slow Fast
  83. 83. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-oriented databases • Use cases • OLAP • Data Analytics • Software • Cassandra • Hbase (Hadoop) • Google BigTable • SAP HANA 83
  84. 84. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 84
  85. 85. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Graph databases • Graph databases are the most specialized of the 4 NoSQL databases. • Instead of modelling data using columns and rows, a graph database uses structures called nodes and relationships. • more formal discussions, they are called vertices and edges • A node is an object that has an identifier and a set of attributes • A relationship is a link between two nodes that contain attributes about that relation. • Graph databases are designed to model adjacency between objects. Every node in the database contains pointers to adjacent objects in the database. • This allows for fast operations that require following paths through a graph. 85
  86. 86. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Graph databases • Example 86image source: NoSQL for Mere Mortals, Dan Sullivan, 2015
  87. 87. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Graph databases • Example 87image source: NoSQL for Mere Mortals, Dan Sullivan, 2015
  88. 88. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 88
  89. 89. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • Hadoop is composed of two primary components that implement the basic concepts of distributed storage and computation: HDFS and YARN • HDFS (sometimes shortened to DFS) is the Hadoop Distributed File System, responsible for managing data stored on disks across the cluster. • YARN acts as a cluster resource manager, allocating computational assets (processing availability and memory on worker nodes) to applications that wish to perform a distributed computation. 89
  90. 90. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture 90 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  91. 91. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • HDFS and YARN work in concert to minimize the amount of network traffic in the cluster primarily by ensuring that data is local to the required computation. • A set of machines that is running HDFS and YARN is known as a cluster, and the individual machines are called nodes. • A cluster can have a single node, or many thousands of nodes, but all clusters scale horizontally, meaning as you add more nodes, the cluster increases in both capacity and performance in a linear fashion. 91
  92. 92. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • Each node in the cluster is identified by the type of process that it runs: • Master nodes • These nodes run coordinating services for Hadoop workers and are usually the entry points for user access to the cluster. • Worker nodes • Worker nodes run services that accept tasks from master nodes either to store or retrieve data or to run a particular application. • A distributed computation is run by parallelizing the analysis across worker nodes. 92
  93. 93. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • For HDFS, the master and worker services are as follows: • NameNode (Master) • Stores the directory tree of the file system, file metadata, and the location of each file in the cluster. • Clients wanting to access HDFS must first locate the appropriate storage nodes by requesting information from the NameNode. • DataNode (Worker) • Stores and manages HDFS blocks on the local disk. • Reports health and status of individual data stores back to the NameNode 93
  94. 94. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • An HDFS cluster with a replication factor of two; the NameNode contains the mapping of files to blocks, and the DataNodes store the blocks and their replicas 94 Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
  95. 95. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • When data is accessed from HDFS • a client application must first make a request to the NameNode to locate the data on disk. • The NameNode will reply with a list of DataNodes that store the data. • the client must then directly request each block of data from the DataNode. 95
  96. 96. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • YARN has multiple master services and a worker service as follows: • ResourceManager (Master) • Allocates and monitors available cluster resources (e.g., physical assets like memory and processor cores) • handling scheduling of jobs on the cluster • ApplicationMaster (Master) • Coordinates a particular application being run on the cluster as scheduled by the ResourceManager 96
  97. 97. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • YARN has multiple master services and a worker service as follows: • NodeManager (Worker) • Runs and manages processing tasks on an individual node as well as reports the health and status of tasks as they’re running 97
  98. 98. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • A small Hadoop cluster with two master nodes and four workers nodes that implements all six primary Hadoop services 98 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  99. 99. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • Clients that wish to execute a job • must first request resources from the ResourceManager, which assigns an application-specific ApplicationMaster for the duration of the job. • the ApplicationMaster tracks the execution of the job. • the ResourceManager tracks the status of the nodes • each individual NodeManager creates containers and executes tasks within them 99
  100. 100. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop architecture • Finally, one other type of cluster is important to note: a single node cluster. • In “pseudo-distributed mode” a single machine runs all Hadoop daemons as though it were part of a cluster, but network traffic occurs through the local loopback network interface. • Hadoop developers typically work in a pseudo-distributed environment, usually inside of a virtual machine to which they connect via SSH. • Cloudera, Hortonworks, and other popular distributions of Hadoop provide pre-built virtual machine images that you can download and get started with right away. 100
  101. 101. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Distributed File System (HDFS) • HDFS provides redundant storage for big data by storing that data across a cluster of cheap, unreliable computers, thus extending the amount of available storage capacity that a single machine alone might have. • HDFS performs best with modest number of very large files • millions of large files (100 MB or more) rather than billions of smaller files that might occupy the same volume. • It is not a good fit as a data backend for applications that require updates in real-time, interactive data analysis, or record-based transactional support. 101
  102. 102. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Distributed File System (HDFS) • HDFS files are split into blocks, usually of either 64MB or 128MB. • Blocks allow very large files to be split across and distributed to many machines at run time. • Additionally, blocks will be replicated across the DataNodes. • by default, the replication is three fold • Therefore, each block exists on three different machines and three different disks, and if even two node fail, the data will not be lost. 102
  103. 103. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Interacting with HDFS • Interacting with HDFS is primarily performed from the command line using the script named hdfs. The hdfs script has the following usage: • The -option argument is the name of a specific option for the specified command, and <arg> is one or more arguments that that are specified for this option. • For example, show help 103 $ hadoop fs [-option <arg>] $ hadoop fs -help
  104. 104. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Interacting with HDFS • List directory contents • use -ls command: • Running the -ls command on a new cluster will not return any results. This is because the -ls command, without any arguments, will attempt to display the contents of the user’s home directory on HDFS. • Providing -ls with the forward slash (/) as an argument displays the contents of the root of HDFS: 104 $ hadoop fs -ls $ hadoop fs -ls /
  105. 105. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Interacting with HDFS • Creating a directory • To create the books directory within HDFS, use the -mkdir command: • For example, create books directory in home directory • Use the -ls command to verify that the previous directories were created: 105 $ hadoop fs -mkdir [directory name] $ hadoop fs -mkdir books $ hadoop fs -ls
  106. 106. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Interacting with HDFS • Copy Data onto HDFS • After a directory has been created for the current user, data can be uploaded to the user’s HDFS home directory with the -put command: • For example, copy book file from local to HDFS • Use the -ls command to verify that pg20417.txt was moved to HDFS: 106 $ hadoop fs -put [source file] [destination file] $ hadoop fs -put pg20417.txt books/pg20417.txt $ hadoop fs -ls books
  107. 107. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Interacting with HDFS • Retrieve (view) Data from HDFS • Multiple commands allow data to be retrieved from HDFS. • To simply view the contents of a file, use the -cat command. -cat reads a file on HDFS and displays its contents to stdout. • The following command uses -cat to display the contents of pg20417.txt 107 $ hadoop fs -cat books/pg20417.txt
  108. 108. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Interacting with HDFS • Retrieve (view) Data from HDFS • Data can also be copied from HDFS to the local filesystem using the -get command. The -get command is the opposite of the -put command: • For example, This command copies pg20417.txt from HDFS to the local filesystem. 108 $ hadoop fs -get [source file] [destination file] $ hadoop fs -get pg20417.txt .
  109. 109. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • MapReduce is a programming model that enables large volumes of data to be processed and generated by dividing work into independent tasks and executing the tasks in parallel across a cluster of machines. • At a high level, every MapReduce program transforms a list of input data elements into a list of output data elements twice, once in the map phase and once in the reduce phase. • The MapReduce framework is composed of three major phases: map, shuffle and sort, and reduce. 109 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  110. 110. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • Map • The first phase of a MapReduce application is the map phase. Within the map phase, a function (called the mapper) processes a series of key-value pairs. • The mapper sequentially processes each key-value pair individually, producing zero or more output key-value pairs • As an example, consider a mapper whose purpose is to transform sentences into words. 110
  111. 111. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • Map • The input to this mapper would be strings that contain sentences, and the mapper’s function would be to split the sentences into words and output the words 111 Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
  112. 112. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • Shuffle and Sort • As the mappers begin completing, the intermediate outputs from the map phase are moved to the reducers. This process of moving output from the mappers to the reducers is known as shuffling. • Shuffling is handled by a partition function, known as the partitioner. The partitioner ensures that all of the values for the same key are sent to the same reducer. • The intermediate keys and values for each partition are sorted by the Hadoop framework before being presented to the reducer. 112
  113. 113. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • Reduce • Within the reducer phase, an iterator of values is provided to a function known as the reducer. The iterator of values is a nonunique set of values for each unique key from the output of the map phase. • The reducer aggregates the values for each unique key and produces zero or more output key-value pairs • As an example, consider a reducer whose purpose is to sum all of the values for a key. The input to this reducer is an iterator of all of the values for a key, and the reducer sums all of the values. 113
  114. 114. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • Reduce • The reducer then outputs a key-value pair that contains the input key and the sum of the input key values 114 Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
  115. 115. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce • Data flow of a MapReduce job being executed on a cluster of a few nodes 115 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  116. 116. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples • In order to demonstrate how data flows through a map and reduce computational pipeline, we will present 3 examples • word counting • IoT data • shared friendship 116
  117. 117. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • The word-counting application takes as input one or more text files and produces a list of word and their frequencies as output. 117 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  118. 118. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Because Hadoop utilizes key/value pairs the input key is a file ID and line number and the input value is a string, while the output key is a word and the output value is an integer. • The following Python pseudocode shows how this algorithm is implemented: 118 # emit is a function that performs hadoop I/O def map(dockey, line): for word in value.split(): emit(word, 1) def reduce(word, values): count = sum(value for value in values) emit(word,count)
  119. 119. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 119 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) input Mapper 1 Mapper 2
  120. 120. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 120 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) input Mapper 1 Mapper 2 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”)
  121. 121. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 121 (“The”,1) (“The”,1) input Mapper 1 Mapper 2 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”)
  122. 122. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 122 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1) input Mapper 1 Mapper 2
  123. 123. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 123 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) input Mapper 1 Mapper 2
  124. 124. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 124 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1) input Mapper 1 Mapper 2
  125. 125. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 125 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1) input Mapper 1 Mapper 2
  126. 126. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 126 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“ran”,1) input Mapper 1 Mapper 2
  127. 127. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example
 (Map) MapReduce examples: word count 127 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) input Mapper 1 Mapper 2
  128. 128. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example 
 (Map) MapReduce examples: word count 128 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) input Mapper 1 Mapper 2
  129. 129. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example (Shuffle & Sort) MapReduce examples: word count 129 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) Mapper 1 Mapper 2 Shuffle & Sort
  130. 130. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 130 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) Mapper 1 Mapper 2 Shuffle & Sort
  131. 131. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 131 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) Mapper 1 Mapper 2 Shuffle & Sort
  132. 132. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 132 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) Mapper 1 Mapper 2 Shuffle & Sort
  133. 133. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 133 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) Mapper 1 Mapper 2 Shuffle & Sort
  134. 134. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 134 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) Mapper 1 Mapper 2 Shuffle & Sort
  135. 135. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 135 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) Mapper 1 Mapper 2 Shuffle & Sort
  136. 136. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 136 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) Mapper 1 Mapper 2 Shuffle & Sort
  137. 137. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 137 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) Mapper 1 Mapper 2 Shuffle & Sort
  138. 138. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Shuffle & Sort) 138 Mapper 1 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Mapper 2 Shuffle & Sort
  139. 139. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 139 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2)
  140. 140. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 140 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2)
  141. 141. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 141 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2)
  142. 142. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 142 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2)
  143. 143. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 143 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1)
  144. 144. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 144 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1)
  145. 145. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 145 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1)
  146. 146. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 146 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1) (“the”,1)
  147. 147. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 147 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1)
  148. 148. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: word count • Example (Reduce) 148 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,2)
  149. 149. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples • In order to demonstrate how data flows through a map and reduce computational pipeline, we will present 3 examples • word counting • IoT data • shared friendship 149
  150. 150. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • IoT applications create an enormous amount of data that has to be processed. This data is generated by physical sensors who take measurements, like room temperature at 8.00 o’Clock. • Every measurement consists of • a key (the timestamp when the measurement has been taken) and • a value (the actual value measured by the sensor). • for example, (2016-05-01 01:02:03, 1). • The goal of this exercise is to create average daily values of that sensor’s data. 150
  151. 151. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: IoT 151 input Mapper 1 Mapper 2 Mapper 3 (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9)
  152. 152. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: IoT 152 input Mapper 1 Mapper 2 Mapper 3 (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9) (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9)
  153. 153. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: IoT 153 input Mapper 1 Mapper 2 Mapper 3 (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9) (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9)
  154. 154. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • Example(Shuffle & Sort) 154 Mapper 1 Mapper 2 Mapper 3 (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9) (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) Shuffle & Sort
  155. 155. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • Example(Shuffle & Sort) 155 Mapper 1 Mapper 2 Mapper 3 (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9) (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) Shuffle & Sort
  156. 156. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • Example(Shuffle & Sort) 156 Mapper 1 Mapper 2 Mapper 3 (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9) (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort
  157. 157. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • Example(Reduce) 157 (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort (“2016-05-01”,5)value = (1+5+9)/3 Reduce
  158. 158. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • Example(Reduce) 158 (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort Reduce (“2016-05-01”,5) value = (2+6+7)/3 (“2016-05-02”,5)
  159. 159. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: IoT • Example(Reduce) 159 (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort (“2016-05-01”,5) value = (3+4+8)/3 (“2016-05-02”,5) (“2016-05-03”,5) Reduce
  160. 160. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples • In order to demonstrate how data flows through a map and reduce computational pipeline, we will present 3 examples • word counting • IoT data • shared friendship 160
  161. 161. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: shared friendship • In the shared friendship task, the goal is to analyze a social network to see which friend relationships users have in common. • Given an input data source where the key is the name of a user and the value is a comma-separated list of friends. 161
  162. 162. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: shared friendship • The following Python pseudocode demonstrates how to perform this computation: 162 def map(person, friends): for friend in friends.split(“,”): pair = sort([person, friend]) emit(pair,friends) def reduce(pair, friends): shared = set(friend[0]) shared = shared.intersection(friends[1]) emit(pair,shared)
  163. 163. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: shared friendship • The mapper create an intermediate keycap of all of the possible (friend, friend) tuples that exist from the initial dataset. • This allows us to analyze the dataset on a per-relationship basis as the value is the list of associated friends. • The pair is sorted, which ensures that the input (“Mike”,“Linda”) and (“Linda”,“Mike”) end up being the same key during aggregation in the reducer. 163
  164. 164. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: shared friendship 164 input (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”)
  165. 165. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: shared friendship 165 input Mapper 1 Mapper 2 (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”) (“Allen, Betty”,”Betty, Chris, David”) (“Allen, Chris”,”Betty, Chris, David”) (“Allen, David”,”Betty, Chris, David”) (“Betty, Ellen”,”Betty, Chris, David”) (“Chris, Ellen”,”Betty, Chris, David”) (“David, Ellen”,”Betty, Chris, David”)
  166. 166. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: shared friendship 166 input Mapper 3 Mapper 4 (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”) (“Allen, Betty”,”Allen, Chris, David, Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”) (“Betty, David”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Betty, Chris”,”Allen, Betty, David,Ellen”) (“Chris, David”,”Allen, Betty, David,Ellen”) (“Chris, Ellen”,”Allen, Betty, David,Ellen”)
  167. 167. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU • Example(Map) MapReduce examples: shared friendship 167 input Mapper 5 (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”) (“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Betty, David”,”Allen, Betty, Chris, Ellen”) (“Chris, David”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Allen, Betty, Chris, Ellen”)
  168. 168. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: shared friendship • Example (Shuffle & Sort) 168 Shuffle & Sort (“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”) (“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”) (“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”) (“Betty, Chris”,”Allen, Betty, David,Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”) (“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”) (“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”) (“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”) (“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”) (“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)
  169. 169. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: shared friendship • Example (Shuffle & Sort) 169 Shuffle & Sort (“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”) (“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”) (“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”) (“Betty, Chris”,”Allen, Betty, David, Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”) (“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”) (“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”) (“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”) (“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”) (“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)
  170. 170. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU MapReduce examples: shared friendship • Example (Reduce) 170 (“Allen, Betty”, “Chris, David”) (“Allen, Chris”, “Betty, David”) (“Allen, David”, “Betty, Chris”) (“Betty, Chris”, “Allen, David, Ellen”) (“Betty, David”, “Allen, Chris, Ellen”) (“Betty, Ellen”, “Chris, David”) (“Chris, David”, “Allen, Betty, Ellen”) (“Chris, Ellen”, “Betty, David”) (“David, Ellen”, “Betty, Chris”)
  171. 171. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming • Hadoop streaming is a utility that comes packaged with the Hadoop distribution and allows MapReduce jobs to be created with any executable as the mapper and/or the reducer. • The Hadoop streaming utility enables Python, shell scripts, or any other language to be used as a mapper, reducer, or both. • The mapper and reducer are both executables that • read input, line by line, from the standard input (stdin), • and write output to the standard output (stdout). • The Hadoop streaming utility creates a MapReduce job, submits the job to the cluster, and monitors its progress until it is complete. 171
  172. 172. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming • When the mapper is initialized, each map task launches the specified executable as a separate process. • The mapper reads the input file and presents each line to the executable via stdin. After the executable processes each line of input, the mapper collects the output from stdout and converts each line to a key-value pair. • The key consists of the part of the line before the first tab character, and the value consists of the part of the line after the first tab character. 172
  173. 173. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming • When the reducer is initialized, each reduce task launches the specified executable as a separate process. • The reducer converts the input key-value pair to lines that are presented to the executable via stdin. • The reducer collects the executables result from stdout and converts each line to a key-value pair. • Similar to the mapper, the executable specifies key-value pairs by separating the key and value by a tab character. 173
  174. 174. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming • Data flow in Hadoop Streaming via Python mapper.py and reducer.py scripts 174 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  175. 175. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • The WordCount application can be implemented as two Python programs: mapper.py and reducer.py. • mapper.py is the Python program that implements the logic in the map phase of WordCount. • It reads data from stdin, splits the lines into words, and outputs each word with its intermediate count to stdout. 175
  176. 176. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • mapper.py 176 #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%st%s' % (word, 1)
  177. 177. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • reducer.py is the Python program that implements the logic in the reduce phase of WordCount. • It reads the results of mapper.py from stdin, sums the occurrences of each word, and writes the result to stdout. • reducer.py 177 #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None
  178. 178. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • reducer.py (cont’) 178 # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue
  179. 179. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • reducer.py (cont’) 179 # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%st%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%st%s' % (current_word, current_count)
  180. 180. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • Before attempting to execute the code, ensure that the mapper.py and reducer.py files have execution permission. • The following command will enable this for both files: • Also ensure that the first line of each file contains the proper path to Python. This line enables mapper.py and reducer.py to execute as standalone executables. • It is highly recommended to test all programs locally before running them across a Hadoop cluster. 180 $ chmod +x mapper.py reducer.py $ echo ‘The fast cat wears no hat’ | ./mapper.py | sort -t 1| ./reducer.py
  181. 181. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • Download 3 ebooks from Project Gutenberg • The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson (659 KB) • The Notebooks of Leonardo Da Vinci (1.4 MB) • Ulysses by James Joyce (1.5 MB) • Before we run the actual MapReduce job, we must first copy the files from our local file system to Hadoop’s HDFS. 181 
 $ hadoop fs -put pg20417.txt books/pg20417.txt $ hadoop fs -put 5000-8.txt books/5000-8.txt $ hadoop fs -put 4300-0.txt books/4300-0.txt $ hadoop fs -ls books
  182. 182. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • The mapper and reducer programs can be run as a MapReduce application using the Hadoop streaming utility. • The command to run the Python programs mapper.py and reducer.py on a Hadoop cluster is as follows: 182 
 $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/ 
 hadoop-streaming-2.0.0-mr1-cdh*.jar -files mapper.py, reducer.py -mapper mapper.py -reducer reducer.py -input /user/hduser/books/* -output /user/hduser/books/output
  183. 183. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hadoop Streaming example • Options for Hadoop streaming 183 Option Description -files A command-separated list of files to be copied to the MapReduce cluster -mapper The command to be run as the mapper -reducer The command to be run as the reducer -input The DFS input path for the Map step -output The DFS output directory for the Reduce step
  184. 184. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Python MapReduce library: mrjob • mrjob is a Python MapReduce library, created by Yelp, that wraps Hadoop streaming, allowing MapReduce applications to be written in a more Pythonic manner. • mrjob enables multistep MapReduce jobs to be written in pure Python. • MapReduce jobs written with mrjob can be tested locally, run on a Hadoop cluster, or run in the cloud using Amazon Elastic MapReduce (EMR). 184
  185. 185. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Python MapReduce library: mrjob • Installation • First, install python pip on CDH VM • The installation of mrjob is simple; it can be installed with pip by using the following command: 185 $ yum -y install python-pip $ pip install mrjob
  186. 186. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU mrjob example • word_count.py • To run the job locally and count the frequency of words within a file named pg20417.txt, use the following command: 186 from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield(word, 1) def reducer(self, word, counts): yield(word, sum(counts)) if __name__ == '__main__': MRWordCount.run() $ python word_count.py books/pg20419.txt
  187. 187. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU mrjob example • The MapReduce job is defined as the class, MRWordCount. Within the mrjob library, the class that inherits from MRJob contains the methods that define the steps of the MapReduce job. • The steps within an mrjob application are mapper, combiner, and reducer. The class inheriting MRJob only needs to define one of these steps. • The mapper() method defines the mapper for the MapReduce job. It takes key and value as arguments and yields tuples of (output_key, output_value). • In the WordCount example, the mapper ignored the input key and split the input value to produce words and counts. 187
  188. 188. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU mrjob example • The combiner is a process that runs after the mapper and before the reducer. • It receives, as input, all of the data emitted by the mapper, and the output of the combiner is sent to the reducer. The combiner yields tuples of (output_key, output_value) as output. • The reducer() method defines the reducer for the MapReduce job. • It takes a key and an iterator of values as arguments and yields tuples of (output_key, output_value). • In example, the reducer sums the value for each key, which represents the frequency of words in the input. 188
  189. 189. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU mrjob example • The final component of a MapReduce job written with the mrjob library is the two lines at the end of the file: if __name__ == '__main__': MRWordCount.run() • These lines enable the execution of mrjob; without them, the application will not work. • Executing a MapReduce application with mrjob is similar to executing any other Python program. The command line must contain the name of the mrjob application and the input file: 189 $ python mr_job.py input.txt
  190. 190. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU mrjob example • By default, mrjob runs locally, allowing code to be developed and debugged before being submitted to a Hadoop cluster. • To change how the job is run, specify the -r/--runner option. 190 $ python word_count.py -r hadoop hdfs:books/pg20419.txt
  191. 191. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 191
  192. 192. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Introduction • The Hadoop ecosystem emerged as a cost effective way of working with large datasets • It imposes a particular programming model, called MapReduce, for breaking up computation tasks into units that can be distributed around a cluster of commodity • Underneath this computation model is a distributed file system called Hadoop Distributed Filesystem (HDFS) • However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that infrastructure is based on traditional relational databases and the Structured Query Language (SQL)? 192
  193. 193. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Introduction • This is where Hive comes in. Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster. • SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model for organizing and using data. • Mapping these familiar data operations to the low-level MapReduce Java API can be daunting, even for experienced Java developers. • Hive does this dirty work for you, so you can focus on the query itself. Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL abstraction. 193
  194. 194. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Introduction • Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly. • Apache Hive is a “data warehousing” framework built on top of Hadoop. • Hive provides data analysts with a familiar SQL-based interface to Hadoop, which allows them to attach structured schemas to data in HDFS and access and analyze that data using SQL queries. • Hive has made it possible for developers who are fluent in SQL to leverage the scalability and resilience of Hadoop without requiring them to learn Java or the native MapReduce API. 194
  195. 195. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hive in the Hadoop Ecosystem • Hive modules 195 Image source: “Programming Hive: Data Warehouse and Query Language for Hadoop”, Edward Capriolo, Dean Wampler and Jason Rutherglen, 2012
  196. 196. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hive in the Hadoop Ecosystem • There are several ways to interact with Hive • CLI: command-line interface • GUI: Graphic User Interface • Karmasphere (http://karmasphere.com) • Cloudera’s open source Hue (https://github.com/cloudera/hue) • All commands and queries go to the Driver, which compiles the input, optimizes the computation required, and executes the required steps, usually with MapReduce jobs. 196
  197. 197. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Hive in the Hadoop Ecosystem • Hive communicates with the JobTracker to initiate the MapReduce job. • Hive does not have to be running on the same master node with the JobTracker. In larger clusters, it’s common to have edge nodes where tools like Hive run. • They communicate remotely with the JobTracker on the master node to execute jobs. Usually, the data files to be processed are in HDFS, which is managed by the NameNode. • The Metastore is a separate relational database (usually a MySQL instance) where Hive persists table schemas and other system metadata. 197
  198. 198. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Structured Data Queries with Hive • Hive provides its own dialect of SQL called the Hive Query Language, or HQL. • HQL supports many commonly used SQL statements, including data definition statements (DDLs) (e.g., CREATE DATABASE/ SCHEMA/ TABLE), data manipulation statements (DMSs) (e.g., INSERT, UPDATE, LOAD), and data retrieval queries (e.g., SELECT). • Hive commands and HQL queries are compiled into an execution plan or a series of HDFS operations and/ or MapReduce jobs, which are then executed on a Hadoop cluster. 198
  199. 199. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Structured Data Queries with Hive • Additionally, Hive queries entail higher-latency due to the overhead required to generate and launch the compiled MapReduce jobs on the cluster; even small queries that would complete within a few seconds on a traditional RDBMS may take several minutes to finish in Hive. • On the plus side, Hive provides the high-scalability and high- throughput that you would expect from any Hadoop-based application. • It is very well suited to batch-level workloads for online analytical processing (OLAP) of very large datasets at the terabyte and petabyte scale. 199
  200. 200. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU The Hive Command-Line Interface (CLI) • Hive’s installation comes packaged with a handy command-line interface (CLI), which we will use to interact with Hive and run our HQL statements. • This will initiate the CLI and bootstrap the logger (if configured) and Hive history file, and finally display a Hive CLI prompt: • You can view the full list of Hive options for the CLI by using the -H flag: 200 $ hive hive> $ hive -H
  201. 201. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU HUE: Apache Hadoop UI • HUE (Hadoop User Experience) is a Web interface for analyzing data with Apache Hadoop. • Go to quick start.cloudera:8888/about • username: cloudera • password: cloudera 201
  202. 202. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Query Editors • Click Query Editors then Hive 202
  203. 203. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Example: web logs database • Choose default database • HQL: SELECT * FROM web_logs 203
  204. 204. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Example: web logs database • HQL: SELECT web_logs.country_name, count(1) AS count
 FROM web_logs 
 GROUP BY country_name 204
  205. 205. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Creating a database • Creating a database in Hive is very similar to creating a database in a SQL-based RDBMS, by using the CREATE DATABASE or CREATE SCHEMA statement: • When Hive creates a new database, the schema definition data is stored in the Hive metastore. • Hive will raise an error if the database already exists in the metastore; we can check for the existence of the database by using IF NOT EXISTS: • HQL: CREATE DATABASE IF NOT EXISTS flight_data; 205
  206. 206. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Creating a database • We can then run SHOW DATABASES to verify that our database has been created. Hive will return all databases found in the metastore, along with the default Hive database: • HQL: SHOW DATABASES; 206
  207. 207. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Creating tables • Hive provides a SQL-like CREATE TABLE statement, which in its simplest form takes a table name and column definitions: • HQL: CREATE TABLE airlines (code INT, 
 description STRING) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 
 STORED AS TEXTFILE; • However, because Hive data is stored in the file system, usually in HDFS or the local file system • the CREATE TABLE command also takes optional clauses to specify the row format with the ROW FORMAT clause that tells Hive how to read each row in the file and map to our columns. 207
  208. 208. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Loading data • It’s important to note one important distinction between Hive and traditional RDBMSs with regards to schema enforcement: • Traditional relational databases enforce the schema on writes by rejecting any data that does not conform to the schema as defined; • Hive can only enforce queries on schema reads. If in reading the data file, the file structure does not match the defined schema, Hive will generally return null values for missing fields or type mismatches 208
  209. 209. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Loading data • Data loading in Hive is done in batch-oriented fashion using a bulk LOAD DATA command or by inserting results from another query with the INSERT command. • LOAD DATA is Hive’s bulk loading command. INPATH takes an argument to a path on the default file system (in this case, HDFS). • We can also specify a path on the local file system by using LOCAL INPATH instead. Hive proceeds to move the file into the warehouse location. • If the OVERWRITE keyword is used, then any existing data in the target table will be deleted and replaced by the data file input; otherwise, the new data is added to the table. 209
  210. 210. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Loading data • Examples • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/ontime_flights.tsv' 
 OVERWRITE INTO TABLE flights; • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/airlines.tsv' 
 OVERWRITE INTO TABLE airlines; • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/carriers.tsv' 
 OVERWRITE INTO TABLE carriers; • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/cancellation_reasons.tsv' 
 OVERWRITE INTO TABLE cancellation_reasons; 210
  211. 211. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Data Analysis with Hive • Grouping • HQL: SELECT airline_code, COUNT(1) AS num_flights 
 FROM flights 
 GROUP BY airline_code 
 ORDER BY num_flights DESC; 211
  212. 212. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Data Analysis with Hive • Aggregations • HQL: 
 SELECT airline_code, COUNT(1) AS num_flights, SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays, SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays, SUM(IF(is_cancelled, 1, 0)) AS num_cancelled, FROM flights GROUP BY airline_code; 212
  213. 213. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Data Analysis with Hive • Aggregations • HQL: 
 SELECT airline_code, COUNT(1) AS num_flights, SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays, ROUND(SUM(IF(depart_delay > 0, 1, 0))/COUNT(1), 2) 
 AS depart_delay_rate, SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays, ROUND(SUM(IF(arrive_delay > 0, 1, 0))/COUNT(1), 2) 
 AS arrive_delay_rate, SUM(IF(is_cancelled, 1, 0)) AS num_cancelled, ROUND(SUM(IF(is_cancelled, 1, 0))/COUNT(1), 2) 
 AS cancellation_rate FROM flights GROUP BY airline_code ORDER by cancellation_rate DESC, arrive_delay_rate DESC, 
 depart_delay_rate DESC; 213
  214. 214. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Introduction to HBase • While Hive provides a familiar data manipulation paradigm within Hadoop, it doesn’t change the storage and processing paradigm, which still utilizes HDFS and MapReduce in a batch-oriented fashion. • Thus, for use cases that require random, real-time read/ write access to data, we need to look outside of standard MapReduce and Hive for our data persistence and processing layer. • The real-time applications need to record high volumes of time-based events that tend to have many possible structural variations. • The data may be keyed on a certain value, like User, but the value is often represented as a collection of arbitrary metadata. 214
  215. 215. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Introduction to HBase • For example, two events, “Like” and “Share”, which require different column values, as shown in table. • In a relational model, rows are sparse but columns are not. That is, upon inserting a new row to a table, the database allocates storage for every column regardless of whether a value exists for that field or not. • However, in applications where data is represented as a collection of arbitrary fields or sparse columns, each row may use only a subset of available columns, which can make a standard relational schema both a wasteful and awkward fit. 215
  216. 216. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-Oriented Databases • NoSQL is a broad term that generally refers to non-relational databases and encompasses a wide collection of data storage models, including • graph databases • document databases • key/ value data stores • column-family databases. • HBase is classified as a column-family or column-oriented database, modelled on Google’s Big Table architecture. 216
  217. 217. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-Oriented Databases • HBase organizes data into tables that contain rows. Within a table, rows are identified by their unique row key, which do not have a data type. • Row key are similar to the concept of primary keys in relational databases, in that they are automatically indexed. 217
  218. 218. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-Oriented Databases • In HBase, table rows are sorted by their row key and because row keys are byte arrays, almost anything can serve as a row key from strings to binary representations of longs or even serialized data structures. • HBase stores its data key/value pairs, where all table lookups are performed via the table’s row key, or unique identifier to the stored record data. • Data within a row is grouped into column families, which consist of related columns. 218
  219. 219. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-Oriented Databases • Census data as an HBase schema 219
  220. 220. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-Oriented Databases • Storing data in columns rather than rows has particular benefits for data warehouses and analytical databases where aggregates are computed over large sets of data with potentially sparse values, where not all columns values are present. • Another interesting feature of HBase and BigTable-based column- oriented databases is that the table cells, or the intersection of row and column coordinates, are versioned by timestamp. • HBase is thus also described as being a multidimensional map where time provides the third dimension 220
  221. 221. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Column-Oriented Databases • The time dimension is indexed in decreasing order, so that when reading from an HBase store, the most recent values are found first. • The contents of a cell can be 
 referenced by a 
 {rowkey, column, timestamp} 
 tuple, or we can scan for a 
 range of cell values by time 
 range. 221
  222. 222. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • For the purposes of this HBase overview, we define and work with the HBase shell to design a schema for a linkshare tracker that tracks the number of times a link has been shared. • Generating a schema • When designing schemas in HBase, it’s important to think in terms of the column-family structure of the data model and how it affects data access patterns. • Furthermore, because HBase doesn’t support joins and provides only a single indexed rowkey, we must be careful to ensure that the schema can fully support all use cases. 222
  223. 223. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • First, we need to declare the table name, and at least one column-family name at the time of table definition. • If no namespace is declared, HBase will use the default namespace • We just created a single table called linkshare in the default namespace with one column-family, named link • To alter the table after creation, such as changing or adding column families, we need to first disable the table so that clients will not be able to access the table during the alter operation: 223 hbase> create ‘linkshare’,’link’
  224. 224. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • Good row key design affects not only how we query the table, but the performance and complexity of data access. • By default, HBase stores rows in sorted order by row key, so that similar keys are stored to the same RegionServer. • Thus, in addition to enabling our data access use cases, we also need to be mindful to account for row key distribution across regions. • For the current example, let’s assume that we will use the unique reversed link URL for the row key. 224 hbase> disable ‘linkshare’ hbase> alter ‘linkshare’, ‘statistics’ hbase> enable ‘linkshare’
  225. 225. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • In our linkshare application, we want to store descriptive data about the link, such as its title, while maintaining a frequency counter that tracks the number of times the link has been shared. • We can insert, or put, a value in a cell at the specified table/ row/ column and optionally timestamp coordinates. • To put a cell value into table linkshare at row with row key org.hbase.www under column-family link and column title marked with the current timestamp 225 hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase' hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop' hbase> put 'linkshare', 'com.oreilly.www', 'link:title', ‘O’Reilly.com’
  226. 226. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • The put operation works great for inserting a value for a single cell, but for incrementing frequency counters, HBase provides a special mechanism to treat columns as counters. • To increment a counter, we use the command incr instead of put. • The last option passed is the increment value, which in this case is 1. • Incrementing a counter will return the updated counter value, but you can also access a counter’s current value any time using the get_counter command, specifying the table name, row key, and column: 226 hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1 hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:like’, 1
  227. 227. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • HBase provides two general methods to retrieve data from a table: • the get command performs lookups by row key to retrieve attributes for a specific row, • and the scan command, which takes a set of filter specifications and iterates over multiple rows based on the indicated specifications. • In its simplest form, the get command accepts the table name followed by the row key, and returns the most recent version timestamp and cell value for columns in the row. 227 hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1 hbase> get_counter ‘linkshare’, ‘org.hbase.www’, 
 ‘statistics:share’ hbase> get ‘linkshare’, ‘org.hbase.www’
  228. 228. Eakasit Pacharawongsakda, Ph.D. Big Data Engineering Program, CITE, DPU Real-Time Analytics with HBase • The get command also accepts an optional dictionary of parameters to specify the column( s), timestamp, timerange, and version of the cell values we want to retrieve. For example, we can specify the column( s) of interest • A scan operation is akin to database cursors or iterators, and takes advantage of the underlying sequentially sorted storage mechanism, iterating through row data to match against the scanner specifications. • With scan, we can scan an entire HBase table or specify a range of rows to scan. 228 hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’ hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’, 
 ‘statistics:share’

×