The document provides an agenda for a Hadoop/Big Data introductory session. The agenda covers introductions to big data concepts and Hadoop components like HDFS, MapReduce, Hive, HBase and Sqoop. It discusses working with HDFS and MapReduce, including file reads/writes in HDFS and MapReduce architecture, jobs and execution. Hands-on demos and code samples are proposed to supplement the theoretical content. The goal is to develop an understanding of big data theory and practice.
Core technology of Hyundai Motor Group's EV platform 'E-GMP'
Training
1. Hadoop Big Data Intro
2/16/2013
Hadoop/BigData Intro
Provided agenda
Addition:Theory from papers
Addition: Demo/code samples
Addition: System architecture
Goal: develop some theory
2. Agenda
●
Introduction to Big Data
●
Basic Concepts
●
Hadoop
Overview of Hadoop
Working with HDFS / Map Reduce
Architecture
Anatomy of File write / read
Admin and Development
Introduce other components of Hadoop
ecosystem
3. Agenda (2)
Hive / HBase / Pig / Sqoop
Map Reduce
Features
- Architecture
●
●
Working
Job Execution
We can cover this circa 2005 agenda in 3h
w/some additions. Need hands on lab to
understand the content.
4. Big Data defn.
●
●
Big data, too big to run SQL queries on
Lots of data (cover Google approach which is
what Hadoop is based on)
Replacing Legacy Systems
10x
Building Applications on Hadoop, Compet Gap
Astayanax
DevOps, Packaging, Chaos Monkey, AWS,
Zookeeper
Modifying the Hadoop Components, JIRA
3-4x
5. Big Data Basic Concepts
●
Storing large amounts of data and doing
something with them
–
Some sort of analytics
●
●
Easy: Tableau, Datameer
Competitive Advantage
–
–
Small scale analytics: R, stats 202 , DemographicsWeblog
Large scale analytics:
● cs246
● Should be able to define analytics POCs based on the next
slide which are domain specific
7. Big Data started in 2000, 2 design
problems @Google, 1998-2000
There is a separate Big Data product for each
use case.
●
Google Design Problems/GFS:
–
Store internet pages on hard drives
–
Unstructured data
●
●
●
●
●
●
Collect HTML and Links; images?
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web
Source Jure Leskovic Slides cs264
8. Google M/R
●
Once the data is on 1k machines...
–
–
Traditional method: read file into memory. Can't put
webpages into memory & reading data would
saturate network.
–
●
How to run an algorithm over 1k disk drives?
Soln: Map Reduce. Move the code to the data via
mappers and reducers which are placed on the
same computer as the data
GFS paper/MapReduce Paper. Hadoop =
GFS+M/R
9. Google GFS
●
●
Stored the html/links/images were stored in
BigTable. Store html pages into files. Many
pages per file. Why? Seeks $, store crawl
2 parts:What is a file system? SB=Collection of
inodes
10. R/W in a file system
●
Read the contents of foo.txt
–
●
Go to superblock, find location of datablocks from
pointer in superblock for foo.txt and read them into
memory
Write into foo.txt
–
Go to superblock, write contents into new
datablocks and append addresses of datablocks
into superblock entry for foo.txt.
11. Distribute file system across servers
●
Superblocks=>GFS master =>Hadoop NameN
inodes=>chunkservers=>Hadoop DataNode
12. R/W in distributed file system
●
Read from HDFS foo.txt:
–
●
Go to namenode, find datablock where data is, read
data into memory on client machine. What is the
difference?
Write into HDFS foo.txt:
–
Go to namenode, find empty block, tell client to
send data to an empty block on the datanode,
append the addresses of the new blocks into NN for
foo.txt. What is the difference?
●
Client, Network
14. HDFS Demo
●
List of files
●
NN+DN website
–
http://<name_node_address>:50070/
–
Where is the DN? Port:50075
●
Logs demo
●
Running in single node PD mode
–
–
●
JVM processes are threads vs. separate JVM
processes for each service.
Global vars in mappers good in PD not in cluster
/etc/init.d. Do not download and install tar ball
15. File R/W system issue
●
●
Cache/Disk Drives
Before writing from memory to disk power goes
out. Lost data
Write to Memory
Write to Disk
16. Failures
●
●
●
●
●
●
Commodity servers fail, One server @G may
stay up 3 years (1,000 days)
If you have 1,000 servers, expect to lose 1/day
With 1M machines 1,000 machines fail every
day!
Google 3y vs else once 3w? Why? 20Servers?
GFS paper/restart failed M/R tasks. Not in
Hadoop
Most system designs neglect failure except
Netflix ChaosM
17. What is Hadoop?
●
An implementation of GFS/Map Reduce in
Java.
–
Used at Yahoo, LinkedIn, Facebook, Netflix, Twitter
●
What did each contribute? Use cases?
–
Doug Cutting (Cloudera)/Lucene
–
v1.0 vs v2.0
–
Hadoop Components, HBase, Flume, Sqoop,
Zookeeper, Oozie, Pig, Hive
18. HDFS
●
HDFS is a distributed file system. Hadoop
Distributed File System
–
Unlimited capacity, add more capacity add more
nodes
–
A file SB info is stored in a NN server. Inodes or
datablocks are stored in DN server.
–
Replicate for data locality & error detection/recovery
●
–
Replicate a data block 3x. Why?
HDFS:
●
Append only file system (copy Google Paper)
19. HDFS
●
What is your file system on your laptop?
–
●
Append only or Random R/W?
When is append only bad?
–
Digression:RMW. Editing a word document is what?
Append only or RMW?
–
Design exercise: 200Gb in files. How many files are
there?
–
Does this fit in memory?
20. HDFS Design exercise
●
Many files combined into smaller number of
large files. How to access smaller files?
–
Slower to access for reads
–
If RMW; add modify into new blocks in HDFS. Find
the new blocks and read them into memory is
slower than sequential access on a single node file
system
–
Faster to delete old file and create a new file with
sequential blocks in place.
21. Solns
1) on write write to disk everytime write to memory
–
–
●
Why Good?
Why Bad?
2) lose the data when the power goes out
–
–
●
Why Good?
Why Bad?
FSCK; File System Check Consistency
22. Agenda: Admin and Development
●
HDFS/MR Administration. HBase,etc. different
–
24x7 SLA
●
●
●
●
Hot standbys for maintenance
HDFS:Recovery from User error, restore the file I just
deleted
HDFS/MR Recovery from failures, (not automated in
Hadoop)
MR lagging mapper, cascading failures
24. HDFS Schemas
●
Do you store 20B files on HDFS by file name?
–
–
●
What happens with multiple files with same name?
e.g. test.txt?
Create metadata, partitions
HDFS Schemas:
–
Avro
–
Parquet
●
Dremel column store/encoding
25. Map Reduce Intro(1)
●
Map Reduce
–
Designed in 2000, when there was very little
memory in commodity PCs, ~4GB or less. These
aren't enterprise class servers.
–
This isn't the case today. MultiCPU/MultiCore 192gb
machines are much more reliable with different use
cases
–
M/R idiom is being replaced with non MR systems.
–
What we don't cover
●
Google F1
26. Map Reduce Intro
●
There are 3 parts to how Map Reduce works:
–
–
Shuffle
–
●
Mapper
Reducer
There are 3 parts to a Map Reduce program
–
–
Reducers
–
●
Mappers
Driver
These 2 concepts aren't the same. People get
these mixed up.
27. Map Reduce Part 1
●
1k node cluster; bring the code to the data.
Reduce network traffic
●
Programming idiom
●
Divide task into mappers.
●
Examples of what can be divided and combined
–
Try dividing first, assume you can combine anything
you can divide
–
Divide input file into single lines, send one line to
each server, process each line
28. Word Count
●
●
I can count a text file of words with a single
program.
I can split the file into a mappers and have the
mappers count the words in parallel
FileLine
FileLine
FileLine
FileLine
Mapper
Mapper
Mapper
Mapper
29. Word Count
●
The mappers output K/V pairs onto the
network. These are not Java Strings or Java
objects!
–
–
●
Keys: Comparable, Writable
Values: Writable
Network saturates with multiple M/R jobs.
Network
Reducer
Reducer
Reducer
Reducer
30. Shuffle/Reduce Part 2/3
●
The K/V pairs are sent to the network. The K/V
pairs are sent to certain destinations based on
2 rules:
–
1) each K goes to the same reducer
–
2) all keys are in sorted order
–
3) Output in 2 forms, _SUCCESS and part-00000
–
Custom partitioner to send K to specific Reducer
–
Grouping Comparator: group keys to reducer
–
Sorting Comparator: can modify sort order for
compound keys
37. Protocol Buffers
●
●
Used internally at Google, compact serialization
https://code.google.com/p/protobuf/
“proto bufs” ,not just serializtion, closest to
binary. Used internally in Hadoop.
38. Why do we need Avro,Protobufs?
Binary: no parser, fast, small. OK for objects
maybe this is like Hibernate
39. Thrift
●
Add a server to send/receive objects and do the
serialization/deserialization
40. Map Reduce References
●
What can I do with each text line?
–
Easy: ETL patterns:
●
●
●
–
Match patterns
Count num occurences tokens
Processing files
Harder: Machine Learning/DMWhat can't be easily
done?
●
●
●
K-means clustering
Ullman book: Mining massive
datasets:http://infolab.stanford.edu/~ullman/mmds.html
Jimmy Lin book:http://www.umiacs.umd.edu/~jimmylin/
41. MRv2
●
2 versions of M/R
–
–
●
v1: old api import xxx.mapred, JT/TT
v2: new api, import xxx.mapreduce, RM/NM/JH
YARN, in Hadoop 2.x maintains backward
compatability to M/R v1.
–
Devs start shifting to Hadoop 2.x YARN for new bug
fixes
43. YARN->Enterprise
●
Encrypted/Pluggable Shuffle/Sort
●
Httpfs rewrite or proxyserver
●
V2 user authentication/permissions.
–
Apache Sentry
●
●
●
●
Separate authorization policies per database/schema
Users have to customize for shared data structures
(tables/metadata,(hbase,search,zk). Not in any distro!
Schema metadata needs fine grained auth.
Web app proxy/part of RM to reduce attacks on
exposed RM web server
44. Map Reduce Demo
●
Word Count demo
–
–
HDFS NameNode: http://localhost:50070/
–
ResourceManager http://localhost:8088
–
●
HDFS DataNode: http://localhost:50075/
JobHistory Server http://jhs_host:19888.
Logging mistakes
–
Adding logging to M/R jobs prop to data size and
number times program run. 1TB file means 1TB
logs. Processing 100GB 10x
–
Logs fill up disk crash system
–
Zookeeper logs
45. M/R Pipelines
●
●
The successful organizations never write direct
Mappers/Reducers. They use higher level tools
like Pig,Hive, etc..
Defn:
–
–
●
Workflow:series of M/R jobs
Pipeline: output of one M/R job is the input to
another
Apache Crunch modeled after Google
FlumeJava
46. Google FlumeJava
●
●
●
Introduction of data pipelines based on multiple
M/R stages
Define a parallel collection with a set of parallel
operations
Much easier to use than M/R programming.
Contrast w/UDFs. Less lines of source:
47. Apache Crunch
●
Not just M/R
–
Faster to specify w/API a data processing pipeline
you can customize instead of writing Pig/Hive
scripts, MRPipelines
–
YARN, next version of M/R
–
Supports Apache Spark, SparkPipelines
–
Can keep in memory vs. spill to disk,
MemPipelines
48. Case Study of old systems
●
●
Older generation of Hadoop Components,
Hadoop, Pig, Hive.
Gives insight to stability/capability of products
53. Yahoo
●
Targeting Content, not Search
●
3k Pig jobs in production
●
Hive in small use for analysts, Pig in heavy
production use. Non MR in use now. Matches
Google's progression
54. Mapper Failures
●
What happens? Google's paper restarts failed
tasks. NS
●
Hadoop isn't auto recovery
●
Hadoop Mapper/Reducer Worker Failure:
–
–
Reschedule on another worker
–
Speculative Execution
–
●
Completed ok, in progress reset
(ADD FROM VIDEo)
Master failure, abort and return fail to client
55. M/R Runtime
●
Balancing Cluster capacity
–
#m>>num nodes
–
#r<<#m
–
One HDFS chunk/mapper. Careful w/small files.
Why? Won't just “run” Need admin
56. Bad Design
●
Combiners
–
Reduce network traffic. Google has special
switches for network latency/throughput
–
job.setCombinerClass(IntSumReducer.class);
–
Combiner can execute 0, 1 or many times. Why?
Combiner demo:
57. Greedy Scheduling
●
Google Borg (not published)
●
Mesos/YARN
–
Linux cGroups/containers
–
Allocate memory/CPU to each task
–
IO not implemented; Sync/Async
59. Writing SQL queries in M/R
●
Select * from /tmp/sqlqueries
●
Select a
●
What is the problem with implementing SQL
queries in M/R? What do you get w/a db you
don't get with SQL M/R?
60. After GFS, M/R; Google Sawzall
●
Contributions:
–
High level procedural language simpler than SQL
operating on unstructured data
–
How to deal with performance problems with sparse
data records?
●
–
Protobufs (used in Hadoop). Dense serialization format to
reduce network traffic/disk space
Multiple jobs, multiple users
●
Workqueue (Apache Oozie)
61. Apache Pig
●
Paper, Chris Olsten Stanford/Yahoo Research
●
Related to Google Sawzall
●
Contributions:
–
PigLatin, like a unix pipe model
–
Cat a file, grep and count # of the word 'foo',
sed/awk
–
all data are tuples
–
Write M/R jobs at a higher level than Java
Mappers/Reducers
–
Write multistage M/R pipelines
68. Apache Hive
●
FB data warehouse paper
–
Introduce tables into HDFS (schema)
–
Requires DB to store metadata
–
HiveQL
●
Solved problems
–
–
–
–
Easy for analysts to use, w/o writing MR jobs
Stored metadata unlike PIG
Supports user queries w/joins
Doesn't support UPDATE. Can't update a file in HDFS. Files are
immutable.
69. Hive QL
●
Create table foo(id int)
●
Create table foo(id int) location '/tmp/data/data.txt'.
–
●
●
●
Hive moves the data.txt file! Looks like Hive deleted
it. Use external table; when dropped nothing
happens. Non external table data is deleted after
table dropped.
We can parse in csv files, this is different than a
database b/c we are dealing with unstructured data.
Create table foo1(username string, map<String,int>)
row format delimited fields terminated by '; '
Map is an aggregate type
70. Schema on read vs Schema on
write
●
●
●
Data has to match schema for database.
Process data then import into db. Everything
has to match, columns, format, etc...
Hadoop is schema on read; can create any
schema. Doesn't drop a column not defined like
in DB.
Typically loading data into a database requires
some clean up program to get all the data in the
right form with the right number of columns with
the right data ranges.
71. Hive Serdes
●
Use this to import in data without processing
like in database
●
CREATE TABLE access_log (
●
remote_ip STRING, request_date STRING,method STRING,request STRING,protocol STRING)
●
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
●
WITH SERDEPROPERTIES (
●
"input.regex" = "([^ ]) . . [([^]]+)] "([^ ]) ([^ ]) ([^ "])" *",
●
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s"
●
)
●
STORED AS TEXTFILE;
72. Hive Impl
●
●
External Hive tables are directories in HDFS.
You can delete the files and the tables will be
empty.
Or you can add data into directories and have
the tables grow
●
Hive adds a schema to HDFS
●
HiveServer2:
–
●
Security, multiple clients
Hive+Tez, Hive-0.12+
74. HiveQL
●
Select * from table1;
●
Select col1,col2 from table2;
●
Writing data into Hive
–
Load DATA inpath '/user/dc/tmp' into table1;
–
Load data inpath '/user/dc/tmp' OVERWRITE into
table1; (DELETEs first before writing)
80. Apache HBase
●
Schema design critical point
–
Schema design shows understanding of
architecture & implementation to use case
–
Rows and Column families. Why?
–
Wibidata Apache KijiSchema
81. HBase Client Design
●
●
Do things in parallel then merge the results. Not
JDBC
Mistake:
–
private void doMultipleClients(final Class<? extends
Test> cmd) throws IOException {
–
final List<Thread> threads = new
ArrayList<Thread>(this.N);
–
final int perClientRows = R/N;
–
for (int i = 0; i < this.N; i++) {
–
–
Thread t = new Thread (Integer.toString(i)) {
83. Google Dremel
●
●
●
End of M/R
select count(*) from
publicdata:samples.wikipedia where
REGEXP_MATCH (title, ‘[0-9]*’) AND
wp_namespace = 0;
35B rows in 10s/~35GB. How? 2 tricks
87. Apache Spark
●
UC Berkeley BDAS, in CDH5, support here,
statement from ULM changing
–
Tachyon/Mesos/Spark/Shark/MLBase
88. Apache Storm
●
Storm in HDP 2.X
–
Instead of multiple threads across multiple servers.
–
Sample code
89. Hands On Labs
●
Install Hadoop on Amazon EC2.
–
Goal: learn config/logs/how things run in HDFS &
M/R
●
●
●
M/R programming
–
Goal: understand internals of M/R. Understand
implications of production, how to balance 1k M/R
jobs in a cluster (programming Java M/R)
●
●
HDFS hands on
M/R hands on no programming
cs246, cs246h
Individual Components
90. Hands on Labs
●
Systems labs
–
How to create a Data Repository?
●
HDFS Schemas
–
Zookeeper, coordination and distributed
programming
–
YARN/Mesos examples
–
Spark/Storm
Editor's Notes
When you create and delete files you are adding/removing inodes from the superblock. When you add contents to a file and save it like adding text in word you are adding data blocks to an inode.
Write: write into write ahead log (in memory) then persist to disk files. Small disk files have to be merged to larger files to reduce search time for read
Rowkeys are sorted and split into regions. Sequential design vs. random key. Avoid hotspots limiting cluster throughput
Hbase Regions are autosplit when they get too big on writes