Big data week presentation

Don’t use Hadoop.
(Unless you have to.)

• What is Hadoop?

• Why do people use Hadoop?

• How does it work?

• When should you consider Hadoop?

What is Hadoop?
Apache Hadoop is an open source, java-based
system for processing data on a network of
commodity servers using a map-reduce
paradigm.

How do people use Hadoop?
A few examples from the Apache site
– Amazon search
– Facebook log storage and reporting
– LinkedIn’s People You May Know
– Twitter data analysis
– Yahoo! Uses it for ad targeting
A search on LinkedIn shows people at financial
services, biotech, oil and gas exploration, retail,
and other industries are using Hadoop.

Where did Hadoop come from?
• Hadoop was created by Doug Cutting. It’s
named after his son’s toy elephant.
• Hadoop was written to support Nutch, an
open source web search engine.
Hadoop was spun out in 2006.
• Yahoo! invested in Hadoop,
bringing it to “web scale” by
2008.

Hadoop is open source
• Hadoop is an open source project (Apache
license)
– You can download and install it freely
– You can also compile your own custom version of
Hadoop
• There are three subprojects

Hadoop is written for Java
• The good news: Hadoop runs on a JVM
– You can run Hadoop on your workstation (for testing),
on a private cluster, or in a cloud
– You can write Hadoop jobs in Java, or in Scala, Jruby,
Jython, Clojure, or any other JVM language
– You can use other Java libraries
• The bad news: Hadoop was originally written by
and for Java programmers.
– You can do basic work without knowing Java. But you
will quickly get stuck if you can’t write code.

Hadoop runs on a network of servers

Hadoop runs on commodity servers
• Doesn’t require very fast, very big, or very
reliable servers
• Works better on good quality servers
connected through a fast network
• Hadoop is fault tolerant—multiple copies of
data, protection against failed jobs

When should you consider Hadoop?
• Big problem
• Fits Map/Reduce model
• Don’t need to compute in real time
• Technical team

Picking the right tool for the job

1,000,000,000,000
100,000,000,000
?
10,000,000,000
1,000,000,000
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
Calculator Spreadsheet Numerical Parallel Systems ?
Software

Man / Reduce
• I need 7 volunteers:
– 4 mappers
– 3 reducers
• We’re going to show how map/reduce works
by sorting and counting some notes.

What is Map/Reduce
• You compute things in two phases
– The map step
• Reads the input data
• Transforms the data
• Tags each datum with a key and sends each datum to
the right reducer
– The reduce step
• Collects all the data for each key
• Do some work on the data by key
• Outputs the results

Map/Reduce is over 100 years old
• Hollerith machines from the 1890 census

Good fits for Map/Reduce
• Aggregating unstructured data to enter into a
database (ETL)
• Creating email messages
• Processing log files and creating reports

Problems that don’t perfectly fit
• Logistic regression
• Matrix operations
• Social graph calculations

Batch computation
Hadoop is a shared system that allocates
resources to jobs from a queue. It’s not a real
time system.

Coding example
Suppose that we had some log files with events by
date (say, page views). Let’s count the number of
events by day!

Sample data:

1335300359000,Home Page, Joe
1335300359027,Login,
1335300359031,Home Page, Romy
1335300369123,Settings, Joe
…

A Java Example
• Mappers will
– Read the input files
– Extract the timestamp
– Round to the nearest day
– Set the output key to the day
• Reducers will
– Iterate through records by day, counting records
– Output the count for each day

A Java example (Mapper)
public class exampleMapper
extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String[] values = line.split(",");
Long timeStampLong = Long.parseLong(values[0]);
DateTime timeStamp = new DateTime(timeStampLong);
DateTimeFormatter dateFormat =
ISODateTimeFormat.date();
output.collect(new
Text(dateFormat.print(timeStamp)),
new Text(line));
}

}

A Java example (Reducer)
public class exampleReducer
extends MapReduceBase
implements Reducer<Text, Text, Text,
LongWritable> {

public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text,LongWritable> output,
Reporter reporter) throws IOException {
long count = 0;
while (values.hasNext())
count++;
output.collect(key,
new LongWritable(count));
}

}

A Java example (job file)
public class exampleJob extends Configured implements Tool {

@Override
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub

JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Count events by date");
conf.setInputFormat(TextInputFormat.class);
TextInputFormat.addInputPath(conf, new Path(arg0[0]));

conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
TextOutputFormat.setOutputPath(conf, new Path(arg0[1]));

conf.setMapperClass(exampleMapper.class);
conf.setReducerClass(exampleReducer.class);

JobClient.runJob(conf);

return 0;
}
}

• Tools that make it easier to use Hadoop:

– Hive
– Pig
– Cascading

Cascading
• Tool for constructing Hadoop workflows in Java
• Example:
Scheme pvScheme = new TextLine(new Fields (“timestamp”, …);
Tap source = new Hfs(pvScheme, inpath);
Scheme countScheme = new TextLine(new Files (“date”, “count”);
Tap sink = new Hfs(countScheme, outpath);
Pipe assembly = new Pipe(“pagesByDate”);
Function function = new DateFormatter(Fields(“timestamp”),
“yyyy/mm/dd”);
assembly = new Each(assembly , new Fields(“date”), function);
assembly = new GroupBy(assembly , new Fields (“date”));
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every(assembly , count );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "pagesByDate", source, sink,
assembly );
flow.complete();

Pig
• Tool to write SQL-like queries against Hadoop
• Example:
define TODATE
org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay();
%declare now `date "+%s000"`;
page_views = LOAD ‘PAGEVIEWS’ USING PigStorage()
AS (timestamp:int, page:chararray, user:chararray);
last_week = FILTER page_views BY timestamp > $now – 86400000 * 7;
truncated = FOREACH page_views GENERATE *,
TODATE(timestamp) as date;
grouped = GROUP truncated BY date;
counted = FOREACH grouped GENERATE group as date,
COUNT_STAR(truncated) as N;
sorted = ORDER counted BY date;
STORE sorted INTO ‘results’ USING PigStorage();

Hive
• Tool from Facebook that lets you write SQL
queries against Hadoop
• Example code:

SELECT TO_DATE(timestamp), COUNT(*)
FROM PAGEVIEWS
WHERE timestamp > unix_timestamp()-86400000 * 7
GROUP BY TO_DATE(timestamp)
ORDER BY TO_DATE(timestamp)

Some important related projects
• Hbase
• NextGen Hadoop (0.23)
• Zookeeper
• Mahout
• Giraph

What to do next
• Watch training videos at
http://www.cloudera.com/resource-types/video/

• Get Hadoop (including the code!) at
http://hadoop.apache.org

• Get commercial support from
http://www.cloudera.com/
or http://hortonworks.com/

• Run it in the cloud with Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/

Big data week presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big data week presentation

Similar to Big data week presentation (20)

Recently uploaded

Recently uploaded (20)

Big data week presentation

Editor's Notes