Coding serbia

Dušan Zamurović
@codecentricRS




Name: Dušan Zamurović
Where I come from?
◦ codecentric Novi Sad



What I do?
◦ Java web-app background
◦ ♥ JavaScript ♥
 Ajax with DWR lib

◦ Android
◦ currently Big Data (reporting QA)









me
Big Data
Map/Reduce algorithm
Hadoop platform
Pig language
Showcase
◦ Java Map/Reduce implementation
◦ Pig implementation



Conclusion

A revolution that will transform how we live, work,
and think.


3 Vs of big data
◦ Volume
◦ Variety
◦ Velocity



Every day use-cases
◦ Beautiful
◦ Useful
◦ Funny




The principal characteristic
Studies report
◦ 1.2 trillion gigabytes of new data was created
worldwide in 2011 alone
◦ From 2005 to 2020, the digital universe will grow
by a factor of 300
◦ By 2020 the digital universe will amount to 40
trillion gigabytes (more than 5,200 gigabytes for
every man, woman, and child in 2020)



The biggest growth – unstructured data
◦
◦
◦
◦
◦
◦




Documents
Web logs
Sensor data
Videos and photos
Medical devices
Social media

>90% of this Big Data is unstructured
Analytic value?
◦ 33% valuable info by 2020



Generated at high speed
Needs real-time processing



Example I



◦ Financial world
◦ Thousands or millions of transactions


Example II
◦ Retail
◦ Analyze click streams to offer recommendations

Value of Big Data is potentially great but can be
released only with the right combination of
people, processes and technologies.


…unlock significant value by making
information transparent and usable at much
higher frequency



Measuring heartbeat of a city - Rio de Janeiro



More examples
◦
◦
◦
◦



Product development – most valuable features
Manufacturing – indicators of quality problems
Distribution – optimize inventory and supply chains
Sales – account targeting, resource allocation
 Beer and diapers

Possible issues?

◦ Privacy, security, intellectual property, liability…

"Map/Reduce is a programming model and an
associated implementation for processing and
generating large data sets. Users specify a map
function that processes a key/value pair to
generate a set of intermediate key/value pairs,
and a reduce function that merges all
intermediate values associated with the same
intermediate key.“
- research publication
http://research.google.com/archive/mapreduce.html



In the beginning, there was Nutch



Which problems does it address?
◦ Big Data
◦ Not fit for RDBMS
◦ Computationally extensive



Hadoop && RDBMS

◦ “Get data to process” or “send code where data is”
◦ Designed to run on large number of machines
◦ Separate storage



Distributed File System
◦ Designed for commodity hardware
◦ Highly fault-tolerant
◦ Relaxed POSIX
 To enable streaming access to file system data



Assumptions and Goals
◦
◦
◦
◦
◦

Hardware failure
Streaming data access
Large data sets
Write-once-read-many
Move computation, not data



NameNode
◦
◦
◦
◦
◦



Master server, central component
HDFS cluster has single NameNode
Manages client’s access
Keeps track where data is kept
Single point of failure

Secondary NameNode
◦ Optional component
◦ Checkpoints of the namespace
 Does not provide any real redundancy



DataNode
◦ Stores data in the file system
◦ Talks to NameNode and responds to requests
◦ Talks to other DataNodes
 Data replication



TaskTracker
◦
◦
◦
◦

Should be where DataNode is
Accepts tasks (Map, Reduce, Shuffle…)
Set of slots for tasks
♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________



JobTracker

◦ Farms tasks to specific nodes in the cluster
◦ Point of failure for MapReduce



How it goes?
1.
2.
3.
4.
5.

Client submits jobs  JobTracker
JobTracker, whereis  NameNode
JobTracker locates TaskTracker
JobTracker, tasks  TaskTracker
TaskTracker ♥__ ♥__ ♥__

1. Job failed, TaskTracker informs, JobTracker decides
2. Job done, JobTracker updates status

6. Client can poll JobTracker for information



Platform for analyzing large data sets
◦
◦
◦
◦



Language – Pig Latin
High level approach
Compiler
Grunt shell

Pig compared to SQL
◦ Lazy evaluation
◦ Procedural language
◦ More like an execution plan



Pig Latin statements
◦
◦
◦
◦
◦

A
A
A
A
A

relation is a bag
bag is collection of tuples
tuple is on ordered set of fields
field is piece of data
relation is referenced by name, i.e. alias

A = LOAD 'student' USING PigStorage() AS
(name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)



Data types
◦ Simple









int – signed 32-bit integer
long – signed 64-bit integer
float – 32-bit floating point
double – 64-bit floating point
charrarray – UTF-8 string
bytearray – blob
boolean – since Pig 0.10
datetime

◦ Complex

 tuple – an ordered set of fields
 bag – a collection of tuples
 map – a set of key value pairs

(21,32)
{(21,32),(32,43)}
[pig#latin]



Data structure and defining schemas
◦ Why to define schema?
◦ Where to define schema?
◦ How to define schema?

/* data types not specified */
a = LOAD '1.txt' AS (a0, b0);
a: {a0: bytearray,b0: bytearray}
/* number of fields not known */
a = LOAD '1.txt';
a: Schema for a unknown









Arithmetic: +, -, *, /, %, ? :
Boolean: AND, OR, NOT
Cast
Comparison: ==, !=, <, >, <=, >=, matches
Type construction: (), {}, [] incl. eq. functions
Relational
◦
◦
◦
◦
◦
◦
◦
◦
◦

GROUP
DEFINE
FILTER
FOREACH
JOIN
UNION
STORE
LOAD
SPLIT



Eval functions



Load/Store functions



Math functions



String functions



Datetime functions



Tuple, Bag, Map functions

◦ AVG, MAX, MIN, COUNT, SUM, …
◦ BinStorage
◦ JsonLoader, JsonStorage
◦ PigStorage
◦ ABS, COS, …, EXP, RANDOM, ROUND, …
◦ TRIM, LOWER, SUBSTRING, REPLACE, …
◦ *Between, Get*, …

◦ TOTUPLE, TOBAG, TOMAP



User Defined Functions
◦ Java, Python, JavaScript, Ruby, Groovy



How to write an UDF?
◦ Eval function extends EvalFunc<something>
◦ Load function extends LoadFunc
◦ Store function extends StoreFunc



How to use an UDF?

◦ Register
◦ Define the name of the UDF if you like
◦ Call it




Imaginary social network
A lots of users…
 … with their friends, girlfriends, boyfriends, wives,
husbands, mistresses, etc…



New relationship arises…
◦ … but new friend is not shown in news feed



Where are his/her activities?
◦ Hidden, marked as not important




Find out the value of the relationship
Monitor and log user activities
◦
◦
◦
◦
◦
◦
◦

For each user, of course
Each activity has some value (event weight)
Records user’s activities
Store those logs in HDFS
Analyze those logs from time to time
Calculate needed values
Show only the activities of “important” friends



Events recorded in JSON format

{
"timestamp": 1341161607860,
"sourceUser": "marry.lee",
"targetUser": "ruby.blue",
"eventName": "VIEW_PHOTO",
"eventWeight": 1

}

public enum EventType {
VIEW_DETAILS(3),
VIEW_PROFILE(10),
VIEW_PHOTO(1),
COMMENT(2),
COMMENT_LIKE(1),
WALL_POST(3),
MESSAGE(1);
…
}

static public class InteractionMap extends
Mapper<LongWritable, Text, Text, InteractionWritable>
{
@Override
protected void map(LongWritable offset,
Text text, Context context) … {
…
}
@Override
protected void reduce(Text token,
Iterable<InteractionWritable> interactions,
Context context) … {
…
}

void map(LongWritable offset, Text text, Context context) {
String[] tokens =
String sourceUser
String targetUser
int eventWeight =
context.write(new
new
}

MyJsonParser.parse(text);
= tokens[1];
= tokens[2];
Integer.parseInt(tokens[4]);
Text(sourceUser),
InteractionWritable(targetUser,
eventWeight));

void reduce(Text token, Iterable<InteractionWritable> iActions,
Context context) … {
Map<Text, InteractionValuesWritable> iActionsGroup =
newHashMap<Text,InteractionValuesWritable>();
Iterator<InteractionWritable> iActionsIterator =
iActions.iterator();

while(iActionsIterator.hasNext()) {
InteractionWritable iAction = iActionsIterator.next();
Text targetUser = new Text(iAction.getTargetUser().toString());
int weight = iAction.getEventWeight().get();
int count = 1;
…

…
InteractionValuesWritable iActionValues = iActionGroup.get(tUser);
if (iActionsValues != null) {
weight += iActionValues.getWeight().get();
count = iActionValues.getCount.get() + 1;
}
iActionGroup.put(targetUser,
new InteractionValuesWritable(weight, count));
List orderedInteractions = sortInteractionsByWeight(iActionsGroup);
for (Entry entry : orderedInteractions) {
InteractionsValuesWritable value = entry.getValue();
String resLine = … // entry.key + value.weight + value.count
context.write(token, new Text(resLine));
}
}

casie.keller petar.petrovic 97579 32554
casie.keller marry.lee 97284 32094
casie.keller jane.doe 97247 32400
casie.keller domenico.quatro-formaggi 96712 32106
casie.keller esmeralda.aguero 96665 32251
casie.keller jason.bourne 96499 32043
casie.keller jose.miguel 96304 31927
casie.keller steve.smith 95929 32267
casie.keller john.doe 95664 31996
casie.keller swatka.mawa 95421 31785
casie.keller lee.young 95400 31758
casie.keller ruby.blue 95132 32181
domenico.quatro-formaggi jane.doe 97442 32492
domenico.quatro-formaggi ruby.blue 97072 31916
domenico.quatro-formaggi jason.bourne 96967 3223
…

class JsonLoader extends LoadFunc {
@Override
public InputFormat getInputFormat() throws IOException {
return new TextInputFormat();
}
public ResourceSchema getSchema(String location, Job job) … {
ResourceSchema schema = new ResourceSchema();
ResourceFieldSchema[] fieldSchemas = new
ResourceFieldSchema[SCHEMA_FIELDS_COUNT];
fieldSchemas[0] = new ResourceFieldSchema();
fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP);
fieldSchemas[0].setType(DataType.LONG); …
schema.setFields(fieldSchemas);
return schema;
}
}

class JsonLoader extends LoadFunc {
…
@Override public Tuple getNext() throws IOException {
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
return null;
}
Text jsonRecord = (Text) in.getCurrentValue();
String[] values = MyJsonParser.parse(jsonRecord);
Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values));
return tuple;
} catch (Exception exc) {
throw new IOException(exc);
}
}
}

class AverageWeight extends EvalFunc<String> {
…
@Override public String exec(Tuple input) … {
String output = null;
if (input != null && input.size() == 2) {
Integer totalWeight = (Integer) input.get(0);
Integer totalCount = (Integer) input.get(1);
BigDecimal average = new BigDecimal(totalWeight).
divide(new BigDecimal(totalCount),
SCALE, RoundingMode.HALF_UP);
output = average.stripTrailingZeros().toPlainString();
}
return output;
}
}

REGISTER codingserbia-udf.jar
DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();
interactionRecords = LOAD ‘/blog/user_interaction_big.json’
USING com.codingserbia.udf.JsonLoader();

interactionData = FOREACH interactionRecords GENERATE
sourceUser,
targetUser,
eventWeight;
groupInteraction = GROUP interactionData BY (sourceUser,
targetUser);
…

…
summarizedInteraction = FOREACH groupInteraction GENERATE
group.sourceUser AS sourceUser,
group.targetUser AS targetUser,
SUM(interactionData.eventWeight) AS eventWeight,
COUNT(interactionData.eventWeight) AS eventCount,
AVG_WEIGHT(
SUM(interactionData.eventWeight),
COUNT(interactionData.eventWeight)) AS averageWeight;
result = ORDER summarizedInteraction BY
sourceUser, eventWeight DESC;
STORE result INTO '/results/pig_mr’ USING PigStorage();

Coding serbia

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Coding serbia

Similar to Coding serbia (20)

Recently uploaded

Recently uploaded (20)

Coding serbia

Editor's Notes