This document provides an overview of Apache Hadoop and its components. It discusses what big data is and how Hadoop uses MapReduce and HDFS to process large datasets across clusters. Example use cases are presented, including logging massive amounts of data from devices. Hadoop installations and configurations are covered. The document also demonstrates how to use Pig Latin to analyze Hadoop data, with examples of common Pig statements like LOAD, FILTER, and STORE.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
An introduction to the Python programming language and its numerical abilities will be presented. With this background, Andrew Collette's H5Py module--an HDF5-Python interface--will be explained highlighting the unique and useful similarities between Python data structures and HDF5.
Slides for a Machine Learning Course in R,
includes an introduction to R and several ML methods for classification, regression, clustering and dimensionality reduction.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.
A brief overview of using HDF5 with Python and Andrew Collette's h5py module will be presented, including examples which show how and why Python can be used in the place of HDF5 tools. Extensions to the HDF5 API will be proposed which would further improve the utility of Python/h5py.
Datasets with millions of events in charm decays at LHCb have prompted the development of powerful fitting and analysis tools capable of handling unbinned datasets using GPUs and multithreaded architectures.
GooFit, the original GPU fitting program with a familiar syntax resembling classic RooFit, has undergone significant redesign and has expanded physics and computing capabilities. The performance has been improved and tested on a variety of systems. GooFit 2.0 is easier than ever to install, develop, and use on any system.
A new templated header-only library, Hydra, provides highly optimized general framework for fits, Monte Carlo generation, integration, and more. The design and benefits of this system along with initial tests will be shown.
Finally, a model-independent search for direct CP violation using an unbinned approach called an energy test was performed directly using the Thrust library (which both of the previous packages are based on). Public results from this analysis and performance comparisons will be presented.
The process of sorting has been one of those problems in computer science that have been around almost from the beginning of time. For example, the tabulating machine (IBM, 1890’s Census) was the first early data processing
unit able to sort data cards for people in the USA. After all the first census took around 7 years to be finished, making all the stored data obsolete. Therefore, the need for sorting. It is more, studying the different techniques of sorting
allows for a more precise introduction of the algorithm concept. Some corrections were done to a bound for the max-heapfy... My deepest excuses for the mistakes!!!
( Python Training: https://www.edureka.co/python )
This Edureka Python Numpy tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) explains what exactly is Numpy and how it is better than Lists. It also explains various Numpy operations with examples.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
This tutorial helps you to learn the following topics:
1. What is Numpy?
2. Numpy v/s Lists
3. Numpy Operations
4. Numpy Special Functions
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
An introduction to the Python programming language and its numerical abilities will be presented. With this background, Andrew Collette's H5Py module--an HDF5-Python interface--will be explained highlighting the unique and useful similarities between Python data structures and HDF5.
Slides for a Machine Learning Course in R,
includes an introduction to R and several ML methods for classification, regression, clustering and dimensionality reduction.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.
A brief overview of using HDF5 with Python and Andrew Collette's h5py module will be presented, including examples which show how and why Python can be used in the place of HDF5 tools. Extensions to the HDF5 API will be proposed which would further improve the utility of Python/h5py.
Datasets with millions of events in charm decays at LHCb have prompted the development of powerful fitting and analysis tools capable of handling unbinned datasets using GPUs and multithreaded architectures.
GooFit, the original GPU fitting program with a familiar syntax resembling classic RooFit, has undergone significant redesign and has expanded physics and computing capabilities. The performance has been improved and tested on a variety of systems. GooFit 2.0 is easier than ever to install, develop, and use on any system.
A new templated header-only library, Hydra, provides highly optimized general framework for fits, Monte Carlo generation, integration, and more. The design and benefits of this system along with initial tests will be shown.
Finally, a model-independent search for direct CP violation using an unbinned approach called an energy test was performed directly using the Thrust library (which both of the previous packages are based on). Public results from this analysis and performance comparisons will be presented.
The process of sorting has been one of those problems in computer science that have been around almost from the beginning of time. For example, the tabulating machine (IBM, 1890’s Census) was the first early data processing
unit able to sort data cards for people in the USA. After all the first census took around 7 years to be finished, making all the stored data obsolete. Therefore, the need for sorting. It is more, studying the different techniques of sorting
allows for a more precise introduction of the algorithm concept. Some corrections were done to a bound for the max-heapfy... My deepest excuses for the mistakes!!!
( Python Training: https://www.edureka.co/python )
This Edureka Python Numpy tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) explains what exactly is Numpy and how it is better than Lists. It also explains various Numpy operations with examples.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
This tutorial helps you to learn the following topics:
1. What is Numpy?
2. Numpy v/s Lists
3. Numpy Operations
4. Numpy Special Functions
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
In this session you will learn:
PIG
PIG - Overview
Installation and Running Pig
Load in Pig
Macros in Pig
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
This presentation talks about the future of the mobile advertising ecosystem. How are the ad servers going to compete with SSPs and DSPs, the Role of the RTB and a critical view of today's mobile advertising ecosystem.
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...Asif Ali
This presentation explains the strategies behind successful mobile advertising campaigns while also looking at the pros and cons of self service vs managed campaigns.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
3. What is Big Data?
A technology term about
Data that becomes too
large to be managed in a
manner that is previously
known to work normally.
4. Apache Hadoop
Is an open source, top level Apache project that is based on
Google's MapReduce whitepaper
Is a popular project that is used by several large companies to
process massive amounts of data
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using a simple
programming model.
5. Hadoop
Is designed to Scale!
Uses commodity hardware and can be
scaled out using several boxes
Processes data in batches
Can process Petabyte scale data
6. Before you consider Hadoop,
define your use case
Before using Hadoop or Any other "big data" or
"NoSQL" solution, consider defining your use
case well.
Regular databases can solve most common
use cases if scaled well.
For example MySQL can handle 100s of
millions of records efficiently if scaled well.
7. Hadoop is best suited when
You have massive data - Terrabytes of data
generated or generating everyday and would
like to process this data for several group
queries, in a scallable manner.
You have billions of rows of data that needs to
be disected for several different reports
8. Hadoop, Components and Related
Projects
HDFS - Hadoop Distributed File System
Map Reduce - Hadoop Distributed Execution
Framework
Apache Pig - Functional Dataflow Language
for Hadoop M/R (Yahoo).
Hive - SQL like language for Hadoop M/R
(Facebook)
Zookeeper
Etc
9. Our use case and previous solution
Our use case: Store and process 100s of
millions of records a day. Run 100s of queries -
summing, aggregating data.
Along the line - we had evaluated different
NoSQL technologies for various use cases (not
necessarily for the one above)
MongoDB, Cassandra, Memcached etc were
implemented for various use cases
11. Issues faced
1. Not a known / preferred design approach.
2. Scalability issues which we had to address it
ourselves
3. Needed a solution that can handle billions of
records per day instead of 100s of millions
4. Needed a truly proven, scalable solution
12. Why Hadoop
A proven, open source, highly reliable,
distributed data processing platform.
Met our use case of processing 100s of millions
of logs perfectly
We tuned the deployment to process all data
with a max of 30 mins latency
14. Installation
Requirements
Linux
0.20.203.X current stable version
Java 1.6.x
Using RPM
Configuration
Single Node
Multiple Node
15. Installation
Modes
Local(Standalone)
Running in single node with a single Java
process
Pseudo
Running in single node with a seperate
Java process
Fully Distributed
Running in clusters of nodes with a
seperate Java process
19. Sample Physical Architecture
D
A N
T COLLECTOR
A
A M
N NODE
E DB
O N NODE
D O
E D
S E
GLUE + PIG ZOOKEEPER
NODE (VMs)
STREAM STREAM STREAM STREAM
APP NODE 1 APP NODE 2 APP NODE 3 APP NODE N
20. Sample Logical Architecture
D D D D D N
A A A A A A
T T T T T M
A A A A A E GLUE
N N N N N N DB
O O O O O O
D D D D D D
E E E E E E PIG
ZOOKEEPER
COLLECTOR
STREAM STREAM STREAM STREAM STREAM STREAM
STREAM
APP 1 APP 2 APP 3 APP 4 APP 5 APP 6 APP N
21. Implementation of a
Hadoop System
BigStreams - Logging Framework
Streams is a high availability, extremely fast, low resource usage real time log
collection framework for terrabytes of data.
- Key author is Gerrit, our Architect
http://code.google.com/p/bigstreams/
Google Protobuf
http://code.google.com/p/protobuf/
for compressing(LZO) the data before transfering the data from application
node to Hadoop node
22. Implementation
Data Logs will be compressed by stream agents and send
to Collector
Collector informs Namenode regarding new file arrivals
Namenode replies with File sizes and how many blocks
and each blocks will be stored into datanodes
Collector will send the block of file to directly to the
datanodes
23. Data Processing with Pig
Once data is saved in the HDFS cluster it can
be processed using Java programs or by using
Apache Pig
http://pig.apache.org
1.Apache Pig is a platform platform for analyzing large data
sets
2. Pig latin is the language which presents a simplified
manner to run queries
3. Pig platform has a compiler which translates Pig queries
to MapReduce Programs for Hadoop
25. Requirements
Unix and Windows users need the following:
1. Hadoop 0.20.2 - http://hadoop.apache.org/common/releases.html
2. Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (set
JAVA_HOME to the root of your Java installation)
3. Ant 1.7 - http://ant.apache.org/ (optional, for builds)
4. JUnit 4.5 - http://junit.sourceforge.net/ (optional, for unit tests)
Windows users need to install Cygwin and the Perl package: http://www.
cygwin.com/
Download Link:
http://www.gtlib.gatech.edu/pub/apache//pig/pig-0.8.1/
26. Pig Installing Commands
To install Pig on Red Hat systems:
$rpm -ivh --nodeps pig-0.8.0-1x86_64
To start the Grunt Shell:
$ pig
0 [main] INFO org.apache.pig.backend.hadoop.executionengine.
HExecutionEngine - Connecting to Hadoop file system at: hdfs://localhost:8020
352 [main] INFO org.apache.pig.backend.hadoop.executionengine.
HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
grunt>
27. Pig Statements
LOAD
Loads data from the file system.
Usage
Use the LOAD operator to load data from the file system.
Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
123
421
834
In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD
statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type
bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
28. Pig Statements
FOREACH
Generates data transformations based on columns
of data.
Syntax : alias = FOREACH { gen_blk | nested_gen_blk };
Usage
Use the FOREACH…GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use
the FILTER operation).
FOREACH...GENERATE works with relations (outer bags) as well as inner bags:
● If A is a relation (outer bag), a FOREACH statement could look like this.
● X = FOREACH A GENERATE f1;
● If A is an inner bag, a FOREACH statement could look like this.
● X = FOREACH B {
S = FILTER A BY 'xyz';
GENERATE COUNT (S.$0);
}
29. Pig Statements
GROUP
Groups the data in one or more relations.
Syntax
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY
partitioner] [PARALLEL n];
Usage
The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the
group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP
operation is a relation that includes one tuple per group. This tuple contains two fields:
● The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.
● The second field takes the name of the original relation and is type bag.
● The names of both fields are generated by the system as shown in the example below.
Note the following about the GROUP/COGROUP and JOIN operators:
● The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN
creates a flat set of output tuples
30. Pig Statements
Example For Group
Suppose we have relation A.
A = LOAD 'data' as (f1:chararray, f2:int, f3:int);
DUMP A;
(r1,1,2)
(r2,2,1)
(r3,2,8)
(r4,4,4)
In this example the tuples are grouped using an expression, f2*f3.
X = GROUP A BY f2*f3;
DUMP X;
(2,{(r1,1,2),(r2,2,1)})
(16,{(r3,2,8),(r4,4,4)})
31. Pig Statements
FILTER
Selects tuples from a relation based on some
condition.
Syntax: alias = FILTER alias BY expression;
Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH...
GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)
32. Pig Statements
STORE
Stores or saves results to the file system.
Syntax
STORE alias INTO 'directory' [USING function];
Usage
Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for
production scripts and batch mode processing.
Note: To debug scripts during development, you can use DUMP to check intermediate results.
33. Pig Statements
Examples For STORE
In this example data is stored using PigStorage and the asterisk character (*) as the field delimiter.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STORE A INTO 'myoutput' USING PigStorage ('*');
CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
34. Pig Latin Example
ads = LOAD '/log/raw/old/ads/month=09/day=01,/log/raw/old/clicks/month=09/day=01' using
com.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore('ad_data');
r1 = foreach ads generate
ad.id as id,
device.carrier_name as carrier_name,
device.device_name as device_name,
device.mobile_model as mobile_model,
ipinfo.city as city_code,
ipinfo.country as country,
ipinfo.ipaddress as ipaddress,
ipinfo.region_code as wifi_Gprs,
site.client_id as client_id,
ad.metrics as metrics,
impressions,
clicks;
g = group r1 by (id,carrier_name,device_name,mobile_model,city_code,country,client_id,wifi_Gprs,metrics);
r = foreach g generate FLATTEN(group), SUM($1.impressions) as imp, SUM($1.clicks) as cl;
rmf /tmp/predicttesnul;
store d into '/tmp/predicttesnul' using PigStorage('t');
35. Demos / Contact us
Some hands on Demos follow
Need more information -
Find us on Twitter
twitter.com/azifali