Dušan Zamurović
@codecentricRS



Name: Dušan Zamurović
Where I come from?
◦ codecentric Novi Sad



What I do?
◦ Java web-app background
◦ ♥ JavaScri...








me
Big Data
Map/Reduce algorithm
Hadoop platform
Pig language
Showcase
◦ Java Map/Reduce implementation
◦ Pi...
A revolution that will transform how we live, work,
and think.


3 Vs of big data
◦ Volume
◦ Variety
◦ Velocity



Every...



The principal characteristic
Studies report
◦ 1.2 trillion gigabytes of new data was created
worldwide in 2011 alone
...


The biggest growth – unstructured data
◦
◦
◦
◦
◦
◦




Documents
Web logs
Sensor data
Videos and photos
Medical devic...


Generated at high speed
Needs real-time processing



Example I



◦ Financial world
◦ Thousands or millions of trans...
Value of Big Data is potentially great but can be
released only with the right combination of
people, processes and techno...


Measuring heartbeat of a city - Rio de Janeiro



More examples
◦
◦
◦
◦



Product development – most valuable featur...
"Map/Reduce is a programming model and an
associated implementation for processing and
generating large data sets. Users s...


In the beginning, there was Nutch



Which problems does it address?
◦ Big Data
◦ Not fit for RDBMS
◦ Computationally ...


Distributed File System
◦ Designed for commodity hardware
◦ Highly fault-tolerant
◦ Relaxed POSIX
 To enable streaming...


NameNode
◦
◦
◦
◦
◦



Master server, central component
HDFS cluster has single NameNode
Manages client’s access
Keeps ...


DataNode
◦ Stores data in the file system
◦ Talks to NameNode and responds to requests
◦ Talks to other DataNodes
 Dat...


JobTracker

◦ Farms tasks to specific nodes in the cluster
◦ Point of failure for MapReduce



How it goes?
1.
2.
3.
4...


Platform for analyzing large data sets
◦
◦
◦
◦



Language – Pig Latin
High level approach
Compiler
Grunt shell

Pig c...


Pig Latin statements
◦
◦
◦
◦
◦

A
A
A
A
A

relation is a bag
bag is collection of tuples
tuple is on ordered set of fie...


Data types
◦ Simple









int – signed 32-bit integer
long – signed 64-bit integer
float – 32-bit floating p...


Data structure and defining schemas
◦ Why to define schema?
◦ Where to define schema?
◦ How to define schema?

/* data ...








Arithmetic: +, -, *, /, %, ? :
Boolean: AND, OR, NOT
Cast
Comparison: ==, !=, <, >, <=, >=, matches
Type con...


Eval functions



Load/Store functions



Math functions



String functions



Datetime functions



Tuple, Bag, ...


User Defined Functions
◦ Java, Python, JavaScript, Ruby, Groovy



How to write an UDF?
◦ Eval function extends EvalFu...



Imaginary social network
A lots of users…
 … with their friends, girlfriends, boyfriends, wives,
husbands, mistresse...



Find out the value of the relationship
Monitor and log user activities
◦
◦
◦
◦
◦
◦
◦

For each user, of course
Each a...


Events recorded in JSON format

{
"timestamp": 1341161607860,
"sourceUser": "marry.lee",
"targetUser": "ruby.blue",
"ev...
public enum EventType {
VIEW_DETAILS(3),
VIEW_PROFILE(10),
VIEW_PHOTO(1),
COMMENT(2),
COMMENT_LIKE(1),
WALL_POST(3),
MESSA...
static public class InteractionMap extends
Mapper<LongWritable, Text, Text, InteractionWritable>
{
@Override
protected voi...
void map(LongWritable offset, Text text, Context context) {
String[] tokens =
String sourceUser
String targetUser
int even...
void reduce(Text token, Iterable<InteractionWritable> iActions,
Context context) … {
Map<Text, InteractionValuesWritable> ...
…
InteractionValuesWritable iActionValues = iActionGroup.get(tUser);
if (iActionsValues != null) {
weight += iActionValues...
casie.keller petar.petrovic 97579 32554
casie.keller marry.lee 97284 32094
casie.keller jane.doe 97247 32400
casie.keller ...
class JsonLoader extends LoadFunc {
@Override
public InputFormat getInputFormat() throws IOException {
return new TextInpu...
class JsonLoader extends LoadFunc {
…
@Override public Tuple getNext() throws IOException {
try {
boolean notDone = in.nex...
class AverageWeight extends EvalFunc<String> {
…
@Override public String exec(Tuple input) … {
String output = null;
if (i...
REGISTER codingserbia-udf.jar
DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();
interactionRecords = LOAD ‘/blog/use...
…
summarizedInteraction = FOREACH groupInteraction GENERATE
group.sourceUser AS sourceUser,
group.targetUser AS targetUser...
Coding serbia
Coding serbia
Coding serbia
Coding serbia
Coding serbia
Upcoming SlideShare
Loading in...5
×

Coding serbia

871

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
871
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Big data. One of the buzz words of the software industry in the last decade. We all heard about it but I am not sure if we actually can comprehend it as we should and as it deserves. It reminds me of the Universe – mankind has knowledge that it is big, huge, vast, but no one can really understand the size of it. Same can be said for the amount of data being collected and processed every day somewhere in the clouds if IT. As Google’s CEO, Eric Schmidt, once said: “There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”
  • Almost every organization has to deal with huge amounts of data. Much of this exists in conventional structured forms, stored in relational databases. However, the biggest growth comes from unstructured data, both from inside and outside the enterprise - including documents, web logs, sensor data, videos, medical devices and social media. According to some studies, more than 90% of Big Data is unstructured data.The majority of information in the digital universe, 68% in 2012, is created and consumed by consumers watching digital TV, interacting with social media, sending camera phone images and videos between devices and around the Internet, and so on.But, only a fraction if it is explored for analytic value. Some studies say that only 33% of digital universe will be contain valuable info by 2020.
  • As well as volume and variety, Big Data is often said to exhibit &quot;velocity&quot; - meaning that the data is being generated at high speed, and needs real-time processing and analysis. One example of the need for real-time processing of Big Data is in the financial world, where thousands or millions of transactions must be continuously analyzed for possible fraud in a matter of seconds. Another example is in retail, where a business may be analyzing many customer click-streams and purchases to generate real-time intelligent recommendations.
  • organizations create and store more data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performancecompanies are using data collection and analysis to conduct controlled experiments to make better management decisions
  • Measuring heartbeat of a city - Rio de Janeiro6.5M people8M vehicles4M bus passengers44k police...tropical monsoon climateBigData is used to monitor weather, traffic (GPS tracked busses and medical vehicles), police, emergency services - using analytics to predict problems before they occur.Not so beautiful example, but Big Data influences business and decision making- Product Development: incorporate the features that matter most- Manufacturing: flag potential indicators of quality problems- Distribution: quantify optimal inventory and supply chain activities- Marketing: identify your most effective campaigns for engagement and sales- Sales: optimize account targeting, resource allocation, revenue forecastingSeveral issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this.
  • The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
  • The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
  • HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
  • A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.DataNode instances can talk to each other, which is what they do when they are replicating data.TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
  • The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.The JobTracker is a point of failure for the HadoopMapReduce service. If it goes down, all running jobs are halted.Client applications submit jobs to the Job tracker.The JobTracker talks to the NameNode to determine the location of the dataThe JobTracker locates TaskTracker nodes with available slots at or near the dataThe JobTracker submits the work to the chosen TaskTracker nodes.The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.When the work is completed, the JobTracker updates its status.Client applications can poll the JobTracker for information.
  • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.Lazy evaluationPig&apos;s ability to store data at any pointProcedural language, more like an execution plan which offers more control over the flow of processing dataSQL offers an option to join two tables but Pig offers also a choice of implementation of join
  • A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don&apos;t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
  • Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution.Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS clause.You can define a schema that includes both the field name and field type.You can define a schema that includes the field name only; in this case, the field type defaults to bytearray.You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray.If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don&apos;t assign a name to a field (the field is un-named) you can only refer to the field using positional notation.If you assign a type to a field, you can subsequently change the type using the cast operators. If you don&apos;t assign a type to a field, the field defaults to bytearray; you can change the default type using the cast operators.
  • Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in five languages: Java, Python, JavaScript, Ruby and Groovy.The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported.Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.Pig also provides support for Piggy Bank, a repository for JAVA UDFs. Through Piggy Bank you can access Java UDFs written by other users and contribute Java UDFs that you have written.Eval is the most common type of function. It can be used in FOREACH statements for whatever purpose.public String exec(Tuple input)The load/store UDFs control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case.The Pig load/store API is aligned with Hadoop&apos;sInputFormat and OutputFormat classes.The LoadFunc abstract class is the main class to extend for implementing a loader. The methods which need to be overridden are explained below:getInputFormat(): This method is called by Pig to get the InputFormat used by the loader. The methods in the InputFormat (and underlying RecordReader) are called by Pig in the same manner (and in the same context) as by Hadoop in a MapReduce java program. If the InputFormat is a Hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce.If a custom loader using a text-based InputFormat or a file-based InputFormat would like to read files in all subdirectories under a given input directory recursively, then it should use the PigTextInputFormat and PigFileInputFormat classes provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. The Pig InputFormat classes work around a current limitation in the HadoopTextInputFormat and FileInputFormat classes which only read one level down from the provided input directory. For example, if the input in the load statement is &apos;dir1&apos; and there are subdirs &apos;dir2&apos; and &apos;dir2/dir3&apos; beneath dir1, the HadoopTextInputFormat and FileInputFormat classes read the files under &apos;dir1&apos; only. Using PigTextInputFormat or PigFileInputFormat (or by extending them), the files in all the directories can be read.setLocation(): This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying InputFormat. This method is called multiple times by pig - implementations should bear this in mind and should ensure there are no inconsistent side effects due to the multiple calls.prepareToRead(): Through this method the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.getNext(): The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in this method the implementation should use the underlying RecordReader and construct the tuple to return.StoreFunc abstract class has the main methods for storing data and for most use cases it should suffice to extend it.The methods which need to be overridden in StoreFunc are explained below:getOutputFormat(): This method will be called by Pig to get the OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects.setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls.prepareToWrite(): In the new API, writing of the data is through the OutputFormat provided by the StoreFunc. In prepareToWrite() the RecordWriter associated with the OutputFormat provided by the StoreFunc is passed to the StoreFunc. The RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the RecordWriter.putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the underlying RecordWriter to write the Tuple out.
  • Coding serbia

    1. 1. Dušan Zamurović @codecentricRS
    2. 2.   Name: Dušan Zamurović Where I come from? ◦ codecentric Novi Sad  What I do? ◦ Java web-app background ◦ ♥ JavaScript ♥  Ajax with DWR lib ◦ Android ◦ currently Big Data (reporting QA)
    3. 3.       me Big Data Map/Reduce algorithm Hadoop platform Pig language Showcase ◦ Java Map/Reduce implementation ◦ Pig implementation  Conclusion
    4. 4. A revolution that will transform how we live, work, and think.  3 Vs of big data ◦ Volume ◦ Variety ◦ Velocity  Every day use-cases ◦ Beautiful ◦ Useful ◦ Funny
    5. 5.   The principal characteristic Studies report ◦ 1.2 trillion gigabytes of new data was created worldwide in 2011 alone ◦ From 2005 to 2020, the digital universe will grow by a factor of 300 ◦ By 2020 the digital universe will amount to 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020)
    6. 6.  The biggest growth – unstructured data ◦ ◦ ◦ ◦ ◦ ◦   Documents Web logs Sensor data Videos and photos Medical devices Social media >90% of this Big Data is unstructured Analytic value? ◦ 33% valuable info by 2020
    7. 7.  Generated at high speed Needs real-time processing  Example I  ◦ Financial world ◦ Thousands or millions of transactions  Example II ◦ Retail ◦ Analyze click streams to offer recommendations
    8. 8. Value of Big Data is potentially great but can be released only with the right combination of people, processes and technologies.  …unlock significant value by making information transparent and usable at much higher frequency
    9. 9.  Measuring heartbeat of a city - Rio de Janeiro  More examples ◦ ◦ ◦ ◦  Product development – most valuable features Manufacturing – indicators of quality problems Distribution – optimize inventory and supply chains Sales – account targeting, resource allocation  Beer and diapers Possible issues? ◦ Privacy, security, intellectual property, liability…
    10. 10. "Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.“ - research publication http://research.google.com/archive/mapreduce.html
    11. 11.  In the beginning, there was Nutch  Which problems does it address? ◦ Big Data ◦ Not fit for RDBMS ◦ Computationally extensive  Hadoop && RDBMS ◦ “Get data to process” or “send code where data is” ◦ Designed to run on large number of machines ◦ Separate storage
    12. 12.  Distributed File System ◦ Designed for commodity hardware ◦ Highly fault-tolerant ◦ Relaxed POSIX  To enable streaming access to file system data  Assumptions and Goals ◦ ◦ ◦ ◦ ◦ Hardware failure Streaming data access Large data sets Write-once-read-many Move computation, not data
    13. 13.  NameNode ◦ ◦ ◦ ◦ ◦  Master server, central component HDFS cluster has single NameNode Manages client’s access Keeps track where data is kept Single point of failure Secondary NameNode ◦ Optional component ◦ Checkpoints of the namespace  Does not provide any real redundancy
    14. 14.  DataNode ◦ Stores data in the file system ◦ Talks to NameNode and responds to requests ◦ Talks to other DataNodes  Data replication  TaskTracker ◦ ◦ ◦ ◦ Should be where DataNode is Accepts tasks (Map, Reduce, Shuffle…) Set of slots for tasks ♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________
    15. 15.  JobTracker ◦ Farms tasks to specific nodes in the cluster ◦ Point of failure for MapReduce  How it goes? 1. 2. 3. 4. 5. Client submits jobs  JobTracker JobTracker, whereis  NameNode JobTracker locates TaskTracker JobTracker, tasks  TaskTracker TaskTracker ♥__ ♥__ ♥__ 1. Job failed, TaskTracker informs, JobTracker decides 2. Job done, JobTracker updates status 6. Client can poll JobTracker for information
    16. 16.  Platform for analyzing large data sets ◦ ◦ ◦ ◦  Language – Pig Latin High level approach Compiler Grunt shell Pig compared to SQL ◦ Lazy evaluation ◦ Procedural language ◦ More like an execution plan
    17. 17.  Pig Latin statements ◦ ◦ ◦ ◦ ◦ A A A A A relation is a bag bag is collection of tuples tuple is on ordered set of fields field is piece of data relation is referenced by name, i.e. alias A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); DUMP A; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F)
    18. 18.  Data types ◦ Simple         int – signed 32-bit integer long – signed 64-bit integer float – 32-bit floating point double – 64-bit floating point charrarray – UTF-8 string bytearray – blob boolean – since Pig 0.10 datetime ◦ Complex  tuple – an ordered set of fields  bag – a collection of tuples  map – a set of key value pairs (21,32) {(21,32),(32,43)} [pig#latin]
    19. 19.  Data structure and defining schemas ◦ Why to define schema? ◦ Where to define schema? ◦ How to define schema? /* data types not specified */ a = LOAD '1.txt' AS (a0, b0); a: {a0: bytearray,b0: bytearray} /* number of fields not known */ a = LOAD '1.txt'; a: Schema for a unknown
    20. 20.       Arithmetic: +, -, *, /, %, ? : Boolean: AND, OR, NOT Cast Comparison: ==, !=, <, >, <=, >=, matches Type construction: (), {}, [] incl. eq. functions Relational ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ GROUP DEFINE FILTER FOREACH JOIN UNION STORE LOAD SPLIT
    21. 21.  Eval functions  Load/Store functions  Math functions  String functions  Datetime functions  Tuple, Bag, Map functions ◦ AVG, MAX, MIN, COUNT, SUM, … ◦ BinStorage ◦ JsonLoader, JsonStorage ◦ PigStorage ◦ ABS, COS, …, EXP, RANDOM, ROUND, … ◦ TRIM, LOWER, SUBSTRING, REPLACE, … ◦ *Between, Get*, … ◦ TOTUPLE, TOBAG, TOMAP
    22. 22.  User Defined Functions ◦ Java, Python, JavaScript, Ruby, Groovy  How to write an UDF? ◦ Eval function extends EvalFunc<something> ◦ Load function extends LoadFunc ◦ Store function extends StoreFunc  How to use an UDF? ◦ Register ◦ Define the name of the UDF if you like ◦ Call it
    23. 23.   Imaginary social network A lots of users…  … with their friends, girlfriends, boyfriends, wives, husbands, mistresses, etc…  New relationship arises… ◦ … but new friend is not shown in news feed  Where are his/her activities? ◦ Hidden, marked as not important
    24. 24.   Find out the value of the relationship Monitor and log user activities ◦ ◦ ◦ ◦ ◦ ◦ ◦ For each user, of course Each activity has some value (event weight) Records user’s activities Store those logs in HDFS Analyze those logs from time to time Calculate needed values Show only the activities of “important” friends
    25. 25.  Events recorded in JSON format { "timestamp": 1341161607860, "sourceUser": "marry.lee", "targetUser": "ruby.blue", "eventName": "VIEW_PHOTO", "eventWeight": 1 }
    26. 26. public enum EventType { VIEW_DETAILS(3), VIEW_PROFILE(10), VIEW_PHOTO(1), COMMENT(2), COMMENT_LIKE(1), WALL_POST(3), MESSAGE(1); … }
    27. 27. static public class InteractionMap extends Mapper<LongWritable, Text, Text, InteractionWritable> { @Override protected void map(LongWritable offset, Text text, Context context) … { … } @Override protected void reduce(Text token, Iterable<InteractionWritable> interactions, Context context) … { … }
    28. 28. void map(LongWritable offset, Text text, Context context) { String[] tokens = String sourceUser String targetUser int eventWeight = context.write(new new } MyJsonParser.parse(text); = tokens[1]; = tokens[2]; Integer.parseInt(tokens[4]); Text(sourceUser), InteractionWritable(targetUser, eventWeight));
    29. 29. void reduce(Text token, Iterable<InteractionWritable> iActions, Context context) … { Map<Text, InteractionValuesWritable> iActionsGroup = newHashMap<Text,InteractionValuesWritable>(); Iterator<InteractionWritable> iActionsIterator = iActions.iterator(); while(iActionsIterator.hasNext()) { InteractionWritable iAction = iActionsIterator.next(); Text targetUser = new Text(iAction.getTargetUser().toString()); int weight = iAction.getEventWeight().get(); int count = 1; …
    30. 30. … InteractionValuesWritable iActionValues = iActionGroup.get(tUser); if (iActionsValues != null) { weight += iActionValues.getWeight().get(); count = iActionValues.getCount.get() + 1; } iActionGroup.put(targetUser, new InteractionValuesWritable(weight, count)); List orderedInteractions = sortInteractionsByWeight(iActionsGroup); for (Entry entry : orderedInteractions) { InteractionsValuesWritable value = entry.getValue(); String resLine = … // entry.key + value.weight + value.count context.write(token, new Text(resLine)); } }
    31. 31. casie.keller petar.petrovic 97579 32554 casie.keller marry.lee 97284 32094 casie.keller jane.doe 97247 32400 casie.keller domenico.quatro-formaggi 96712 32106 casie.keller esmeralda.aguero 96665 32251 casie.keller jason.bourne 96499 32043 casie.keller jose.miguel 96304 31927 casie.keller steve.smith 95929 32267 casie.keller john.doe 95664 31996 casie.keller swatka.mawa 95421 31785 casie.keller lee.young 95400 31758 casie.keller ruby.blue 95132 32181 domenico.quatro-formaggi jane.doe 97442 32492 domenico.quatro-formaggi ruby.blue 97072 31916 domenico.quatro-formaggi jason.bourne 96967 3223 …
    32. 32. class JsonLoader extends LoadFunc { @Override public InputFormat getInputFormat() throws IOException { return new TextInputFormat(); } public ResourceSchema getSchema(String location, Job job) … { ResourceSchema schema = new ResourceSchema(); ResourceFieldSchema[] fieldSchemas = new ResourceFieldSchema[SCHEMA_FIELDS_COUNT]; fieldSchemas[0] = new ResourceFieldSchema(); fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP); fieldSchemas[0].setType(DataType.LONG); … schema.setFields(fieldSchemas); return schema; } }
    33. 33. class JsonLoader extends LoadFunc { … @Override public Tuple getNext() throws IOException { try { boolean notDone = in.nextKeyValue(); if (!notDone) { return null; } Text jsonRecord = (Text) in.getCurrentValue(); String[] values = MyJsonParser.parse(jsonRecord); Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values)); return tuple; } catch (Exception exc) { throw new IOException(exc); } } }
    34. 34. class AverageWeight extends EvalFunc<String> { … @Override public String exec(Tuple input) … { String output = null; if (input != null && input.size() == 2) { Integer totalWeight = (Integer) input.get(0); Integer totalCount = (Integer) input.get(1); BigDecimal average = new BigDecimal(totalWeight). divide(new BigDecimal(totalCount), SCALE, RoundingMode.HALF_UP); output = average.stripTrailingZeros().toPlainString(); } return output; } }
    35. 35. REGISTER codingserbia-udf.jar DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight(); interactionRecords = LOAD ‘/blog/user_interaction_big.json’ USING com.codingserbia.udf.JsonLoader(); interactionData = FOREACH interactionRecords GENERATE sourceUser, targetUser, eventWeight; groupInteraction = GROUP interactionData BY (sourceUser, targetUser); …
    36. 36. … summarizedInteraction = FOREACH groupInteraction GENERATE group.sourceUser AS sourceUser, group.targetUser AS targetUser, SUM(interactionData.eventWeight) AS eventWeight, COUNT(interactionData.eventWeight) AS eventCount, AVG_WEIGHT( SUM(interactionData.eventWeight), COUNT(interactionData.eventWeight)) AS averageWeight; result = ORDER summarizedInteraction BY sourceUser, eventWeight DESC; STORE result INTO '/results/pig_mr’ USING PigStorage();
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×