Big data

1,005 views
946 views

Published on

Published in: Education
2 Comments
1 Like
Statistics
Notes
No Downloads
Views
Total views
1,005
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
2
Likes
1
Embeds 0
No embeds

No notes for slide

Big data

  1. 1. Introduction to Big DataByung-Won On, PhDSeoul National UniversityDecember 18, 2012
  2. 2. Outline Big Data MapReduce ◦ Word Count ◦ k-means Clustering Algorithm NoSQL ◦ Neptune Demo ◦ Hadoop Installation ◦ Word Count using MapReduce 2
  3. 3. Big data in my opinion 3
  4. 4. Main keywords related to Bigdata Data in TB or PB Volume Velocity Variety Value Complexity Source: Gartner 2012.11 4
  5. 5. Big Data PlatformSource: 안창원, 황승구, 빅 데이터 기술과 주요 이슈, 정보과학 회지 30권 6호 pp.10-17, 2012 5
  6. 6. Big Data Technology Distributed File Systems ◦ GFS, HDFS Databases ◦ Oracle, DB2, MySQL (RDBMS) ◦ Bigtable, Hbase, Cassandra, MongoDB (NoSQL) Parallel Programming Model ◦ MapReduce, Hive, Pig Analytics & Visualization ◦ Mahout, R, Tableau, Nutch 6
  7. 7. HADOOP & MAPREDUCE 7
  8. 8. Motivated Example 20 billion web pages * 20KB per web page ◦ About 400TB A computer reads 30-35MB/sec from disk ◦ It takes about 4 months to read 8
  9. 9. Cluster Architecture Source: 9 http://cs246.stanford.edu
  10. 10. Challenges How do we distribute computation? How can we make it easy to write parallel programs? How can we handle machine failure? 10
  11. 11. Approach Distributed File Systems ◦ Google File System, Hadoop Distributed File System Parallel Programming Framework ◦ MapReduce in Hadoop 11
  12. 12. Distributed File System Problem ◦ If nodes fail, how to store data persistently? Solution ◦ Hadoop Distributed File System (HDFS), providing global file name space Properties of Data for HDFS ◦ Huge files ~ xx TB ◦ Data is rarely updated in place (i.e., immutable files) ◦ Read/append operations are common 12
  13. 13. Distributed File System Name Node ◦ Store meta data ◦ Active/Stand by Data Nodes ◦ File is split into contiguous chunks ◦ Each chunk ~ 64MB ◦ Each chunk is replicated ~ 3x ◦ Keep replicas in different racks Client (to access files) ◦ Contact the name node to find data nodes ◦ Directly connect to data nodes to access data 13
  14. 14. MapReduce ProgrammingArchitecture 14
  15. 15. Overview Sequentially read big data Map ◦ Extract something you care about Group by key ◦ Sort and shuffle Reduce ◦ Aggregate, summarize, filter, transform Write the result 15
  16. 16. Map Step Source: 16 http://cs246.stanford.edu
  17. 17. Reduce Step Source: 17 http://cs246.stanford.edu
  18. 18. Algorithm Input ◦ A set of key/value pairs A programmer specifies two methods ◦ Map(k,v) => <k’,v’>*  Takes a key value pair and outputs a set of key value pairs  Ex) key is the filename & value is a single line in the file ◦ Reduce(k’,<v’>*) => <k’,v’’>*  All values v’ with the same key k’ are reduced together and processed in v’ order 18
  19. 19. Word Counting usingMapReduce Source: 19 http://cs246.stanford.edu
  20. 20. Word Counting usingMapReduce Map(key, value) //key: document name; value: text of the document for each word w in value: emit(w,1) Reduce(key, values) //key: a word; value: an iterator over counts Result = 0 for each count v in values result += v emit(key, result) 20
  21. 21. MapReduce Task Partition the input data Schedule the program execution across nodes Handle machine failures Managing inter communication between nodes 21
  22. 22. Parallel Processing 22
  23. 23. Name Node Task status ◦ Idle, in-progress, completed Idle tasks are get scheduled when workers are available When a map task finishes, it sends to name node the location and sizes of its intermediate files, one for each reduce worker Name node pushes this information to reducers Name node regularly pings workers to detect failures 23
  24. 24. Failover Map worker failure ◦ Map tasks are completed or in-progress at worker are reset to idle ◦ Reduce workers are notified when task is rescheduled on another worker Reduce worker failure ◦ Only in-progress tasks are reset to idle Master failure ◦ MapReduce task is aborted and client is notified 24
  25. 25. Set-up Map tasks: M Reduce tasks: R M, R are much larger than # of nodes in cluster One chunk data per map is common Improve dynamic load balancing Speed recovery from worker failure Often R is smaller than M ◦ Note that output is spread across R files 25
  26. 26. Combiners A map task will produce many pairs of (k,v1), (k,v2), … for the same key k Popular words in word count Pre aggregate values at the mapper ◦ Combine(k, list(v1)) -> v2 ◦ Combiner is the same as the reduce function 26
  27. 27. Map Step 27
  28. 28. Reduce Step 28
  29. 29. Reduce Step 29
  30. 30. K-MEANS USINGMAPREDUCE 30
  31. 31. Mixed Entities in Web The search result includes a mixture of web pages with different Tom Mitchells Separate different web pages into different groups (called clusters)Byung-Won On, Ingyu Lee and Dongwon Lee, Scalableclustering methods for the name disambiguation problem,Knowledge and Information Systems 31(1):129-151 (2012) 31
  32. 32. ClusteringWeb pages of two different persons with the same name spellings are all mixed in the pool a2 a2 a1 a3 a1 a3 b1 b1 b2 b2 32
  33. 33. k-means w4 w21. Random selection of w5 w3cluster centroid w1 w4 w4 w2 w22. Measure distance w5 w5 w3 w3between centroid and w1 w1object From w4 From w33. Assign each object toeach centroid based on w2 w4short distance w5 w3 w14. Choose new centroidin each cluster based on w4mean in each cluster w2 w5 w35. Repeat step 2 to step w14 until convergencecriterion is met. 33
  34. 34. k-means using MapReduce Do ◦ Map  Input is a data point and k centers are broadcasted  Find the closest center among k centers for the input point ◦ Reduce  Input is one of k centers and all data points having this center as their closest center  Calculate the new center using data points Until all of new centers are not changed 34
  35. 35. Map Step 35
  36. 36. Reduce Step 36
  37. 37. NOSQL 37
  38. 38. Relational DB Manage data in GB or TB Store important data (transaction, personnel, …) Guarantee both consistency and availability Oracle, DB2, MS SQL Server, MySQL 38
  39. 39. Not Only SQL (NoSQL) Manage unstructured data such as text in TB or PB Guarantee partition tolerance Guarantee either consistency or availability Flexible Schema No SQL and join operations Big Table (Google), Dynamo (Amazon), Hbase (Yahoo), Cassandra (Facebook), MongoDB, Neptune (NHN) ◦ Big Table = Hbase = Neptune 39
  40. 40. Neptune: For Managing BigData Analyzing log data from Internet portal or online game service Calculating PageRank or similarities between web pages Search for personalization Social network analysis, recommender systems, blog clustering, etc. 40
  41. 41. System Architecture 41
  42. 42. Component Nodes Master Node ◦ Assign tablets to TabletServers, considering the number and size of tablets TabletServers ◦ Provide insertion/deletion with clients ◦ Store a few thousands tablets (100~200MB per tablet) => a few hundreds GB ◦ In-Memory & Disk DB ◦ Merge tablets if # of files is increasing (Improvement of search operation) ◦ Split tablets if the size of a file is large ◦ (Improvement of performance) Changelog Servers ◦ Store transaction log 42
  43. 43. Data Format  Logical data unit ◦ Table  Row ◦ Rowkey created by systems automatically ◦ Sorted in ascending order  Column ◦ Column key, timestamp ◦ Sorted in lexical order ◦ Get operation  Return a recent data ◦ Column oriented indexing  Divide a table into a set of tablets by rowkey  Store tablets in cluster 43
  44. 44. Meta DataMeta data is stored in the shared memory ofPleiades 44
  45. 45. Real Time Processing  Reasonable performance ◦ A few ms response time  In-Memory DB  Minor compaction ◦ When a memory table is full  Major compaction ◦ Combine multiple tables for fast search operation  Garbage collection 45
  46. 46. MapReduce 46
  47. 47. Client API & Shell Command 47
  48. 48. Failover* Active master sets NEPTUNE_MASTER lock toPleiades releases NEPTUNE_MASTER lock if* Pleiadesactive master is fault & slave master gets the lock 48
  49. 49. Concluding Remarks Big Data ◦ Volume, Velocity, Variety Store/Manage Big Data ◦ Hadoop, NoSQL in cluster Parallel Programming ◦ MapReduce Analytics (Mining & Visualization) ◦ Mahout, R 49
  50. 50. Future Plan: Infra A pilot system for Big Data (2012. 12) ◦ 1 Manage Server ◦ 1 Name Node  2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD ◦ 5 Data Nodes  2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD ◦ Rack Mount ◦ Gigabit Switch Hub ◦ Hadoop & CDH 50
  51. 51. Future Plan: Research Developing machine learning, modeling and optimization algorithms for mining/visualizing public data in TB Re-designing existing data mining algorithms using MapReduce ◦ Data Mining Algorithms: Clustering, Classification, Probabilistic Modeling, Association Rule Mining, Graph Analysis, etc. ◦ Serialization algorithms => Parallel algorithms 51
  52. 52. Reference G. Shim, MapReduce Algorithms for Big Data Analysis, VLDB 2012 Tutorial J. Schindler, I/O Characteristics of NoSQL Databases, VLDB 2012 Tutorial J. Leskovec, Mining Massive Datasets, Available: http://cs246.stanford.edu 김형준, Neptune: 대용량 분산 데이터 관리 시스템, NHN Tech. Report, 2008 T. White, Hadoop: The Definitive Guide, O’Reilly 2012 52
  53. 53. DEMONSTRATION 53
  54. 54. Outline Hadoop Installation Word Counting using MapReduce 54
  55. 55. Software for Hadoop InstallationVirtualBox 4.1.22https://www.virtualbox.org/ http://www.centos.org/ http://www.oracle.com/tec hnetwork/java/index.html http://apache.tt.co.kr/hadoop/commo n/hadoop-1.0.4/hadoop-1.0.4.tar.gz 55
  56. 56. Hadoop Installation Configur JDK Hadoop ation tar xvf hadoop-1.0.4-bin.tar.gz 56
  57. 57. Three Modes for HadoopInstallation Pseudo- Fully Standalone distributed distributed 57
  58. 58. Configuration hadoop-1.0.4/conf Hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml 58
  59. 59. Pseudo-Distributed Mode core-site.xml mapred-site.xml hdfs-site.xml 59
  60. 60. Completion of HadoopInstallation 60
  61. 61. Word Count using MapReduce 61
  62. 62. Word Count using MapReduce 62
  63. 63. Word Count using MapReduce 63
  64. 64. Word Count using MapReduce 64
  65. 65. Text DataInput file Output file 65
  66. 66. Patent ID DataInput file Output file 66

×