This document provides an introduction to Pig, a platform for analyzing large datasets. It discusses how Pig works with Hadoop and HDFS to allow for distributed processing and storage of big data. Pig allows users to write scripts using a simple data flow language to analyze large datasets without needing to write MapReduce programs directly. This improves programmer productivity and makes big data analysis accessible to more users without Java expertise.
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
MongoDB & Hadoop - Understanding Your Big DataMongoDB
Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business.
But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit?
In this webinar, we will explore:
The full spectrum of Big Data
Hadoop and MongoDB: friends or frenemies?
Differences between Systems of Record and Systems of Engagement
MongoDB customer examples of Systems of Engagement
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
GraphDB Cloud is an enterprise grade RDF graph database providing high-performance querying over large volumes of RDF data. On this webinar, Ontotext demonstrates how to instantly create and deploy a fully managed Graph Database, then import & query data with the (OpenRDF) GraphDB Workbench, and finally explore and visualize data with the build in visualization tools.
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
MongoDB & Hadoop - Understanding Your Big DataMongoDB
Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business.
But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit?
In this webinar, we will explore:
The full spectrum of Big Data
Hadoop and MongoDB: friends or frenemies?
Differences between Systems of Record and Systems of Engagement
MongoDB customer examples of Systems of Engagement
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
GraphDB Cloud is an enterprise grade RDF graph database providing high-performance querying over large volumes of RDF data. On this webinar, Ontotext demonstrates how to instantly create and deploy a fully managed Graph Database, then import & query data with the (OpenRDF) GraphDB Workbench, and finally explore and visualize data with the build in visualization tools.
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
Asterix Solution’s big data hadoop training and certification is designed to help applications scale up from single servers to thousands of machines. With the rate at which memory cost decreased the processing speed of data never increased and hence loading the large set of data is still a big headache and here comes Hadoop as the solution for it.
http://www.asterixsolution.com/big-data-hadoop-training-in-mumbai.html
Duration - 25 hrs
Session - 2 per week
Live Case Studies - 6
Students - 16 per batch
Venue - Thane
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.
American Water shares how bringing IoT to fleet management can provide value to the customer. In the utilities industry, fleet management plays a major part in the business. The front line is one of the largest parts of the business whether it is the field employees working on mains, or those working on the customers' property. American Water strives to provide the best customer experience and part of that includes improving the effectiveness of our fleet.
Currently, there is no insight or active feedback on the effectiveness of the routes or driving behaviors. As a PoC, American Water leveraged NiFi to track metrics against a simulated truck, showing the initial values in capturing this type of data.
Technologies: NiFi, Druid, Hive
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
Asterix Solution’s big data hadoop training and certification is designed to help applications scale up from single servers to thousands of machines. With the rate at which memory cost decreased the processing speed of data never increased and hence loading the large set of data is still a big headache and here comes Hadoop as the solution for it.
http://www.asterixsolution.com/big-data-hadoop-training-in-mumbai.html
Duration - 25 hrs
Session - 2 per week
Live Case Studies - 6
Students - 16 per batch
Venue - Thane
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.
American Water shares how bringing IoT to fleet management can provide value to the customer. In the utilities industry, fleet management plays a major part in the business. The front line is one of the largest parts of the business whether it is the field employees working on mains, or those working on the customers' property. American Water strives to provide the best customer experience and part of that includes improving the effectiveness of our fleet.
Currently, there is no insight or active feedback on the effectiveness of the routes or driving behaviors. As a PoC, American Water leveraged NiFi to track metrics against a simulated truck, showing the initial values in capturing this type of data.
Technologies: NiFi, Druid, Hive
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
In this O'Reilly webcast, Ben Sharma (cofounder and CEO of Zaloni) and Vikram Sreekanti (software engineer in the AMPLab at UC Berkeley) discuss the value of collecting and analyzing metadata, and its potential to impact your big data solution and your business.
Watch the replay here: http://oreil.ly/28LO7IW
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
Big Data. Data Science. AI. It's all big business.
Once upon a time we succeeded in these fields by selectively storing, processing and learning from just the right data. This, of course, requires you to know what "the right data" is. We know there are valuable insights in data, so why not store the lot? It's the 21st century equivalent of "there's gold in them thar hills!"
So having spent years stashing away terabytes of your data in PostgreSQL, you want to start learning from that data. Queries. More queries. More complex queries. Lots of real-time queries. Lots of concurrent users. It might be tempting at this point to give up on PostgreSQL and stash your data into a different solution, more suited to purpose. Don't. PostgreSQL can perform very well in HTAP environments and performs even better with a little help.
In this presentation we dive into the current state of the art with regards to PostgreSQL in HTAP environments and expose how hardware acceleration can help squeeze as much knowledge as possible out of your data.
by Ben Willett, Solutions Architect, AWS
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
Loading Data into Redshift: Data Analytics Week at the San Francisco Loft
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Level: Intermediate
Speakers:
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Vikram Gangulavoipalyam - Enterprise Solutions Architect, AWS
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between.
In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again.
In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy.
This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
This presentation starts off by discussing powerful examples of The Power of Data and the benefits of Data Driven architectures. A Data Governance program is important for the success of Data Driven architectures. We then discuss the challenges of implementing a Data Governance framework on a Big Data Data Lake with open source software including DataPlane, Apache Atlas and Apache Ranger. And finally, we discuss the importance of the democratization of data and the switching to a speed of thought framework with Hive LLAP.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
Data Analytics Week at the San Francisco Loft
Data Warehousing with Amazon Redshift
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWSA closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Jay Formosa - Solutions Architect, AWS
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Data Lake, Virtual Database, or Data Hub - How to Choose?DATAVERSITY
Data integration is just plain hard and there is no magic bullet. That said, three new data integration techniques do ameliorate the misery, making silo-busting possible, if not trivial. The three approaches – data lakes, virtual databases (aka federated databases), and data hubs – are a boon to organizations big enough to have separate systems, separate lines of business, and redundant acquired or COTS data stores. Each approach has its place, but how do you make the right decision about which data silo integration approach to choose and when?
This webinar describes how you can use the key concepts of data Movement, Harmonization, and Indexing to determine what you are giving up or investing in, and make the best decision for your project.
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
If you are storing records with a timestamp in your database, it is very likely a time series database can make your life easier.
However, time series databases are still the great unknown for a large part of the tech community.
In this talk, I will show you what use cases they are good for, what they give you that you cannot get from a traditional database, and when it is a good idea (and when it is not) to use them.
For the demos, we will be using QuestDB, the fastest open-source time series database.
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Speakers:
Natalie Rabinovich- Solutions Architect, AWS
Gareth Eagar - Solutions Architect, AWS
My talk from Database Camp 2016 at the United Nations. I focus on how we can bridge the gap between OLTP and OLAP workloads and discuss a very promising new technology called Apache Kudu.
What enterprises can learn from Real Time BiddingAerospike
Brian Bulkowski, CTO of Aerospike, the NoSQL database, discusses the software architecture pioneered in cutting edge advertising optimizations companies in 2008, made popular between 2009 and 2013, and now becoming more widely used in Financial Services, Retail, Social Media, Travel companies, and others. This new technology architecture focuses on multiple big data analytics sources - HDFS based batch engines, using Hadoop, Hive, Hbase, Vertica, Spark, and others depending on analysis and query patterns - with an operational and application layer. The operational application level consists of new internet application stacks, such as Node.js, Nginx, Jetty, Scala, and Go, and in-memory NoSQL databases such as MongoDB, Cassandra, and Aerospike.
Specific recommendations regarding building a high-performance operational layer are presented. In particular, focusing on primary-key access at the operational layer, using Flash for the random in-memory nosql layer, and the benefits of Open Source were presented.
This presentation was given at the Big Data Gurus meetup in Santa Clara, CA, on July 29, 2014. http://www.meetup.com/BigDataGurus/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. PIG in Big Data
9/18/2016 1
Data keeps growing…
2. BIG DATA
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• It requires different approaches:
Techniques, tools and architecture
• To solve new problems or old problems in a better way
• Storage and processing of very large quantities of digital
information that cannot be analyzed with traditional computing
techniques
9/18/2016 2
3. INTRODUCTION TO BIG DATA
…. AND FAR FAR BEYOND
User generated content
Mobile Web
User Click Stream
Sentiment
Social Network
External Demographics
Business Data Feeds
HD Video
Speech to Text
Product / Service Logs
SMS / MMS
Petabytes
WEB
Weblogs
Offer history
A / B Testing
Dynamic Pricing
Affiliate Network
Search Marketing
Behavioral Targeting
Dynamic Funnels
Terabytes
CRM
Segmentation
Offer Details
Customer Touches
Support Contacts
Gigabytes
ERP
Purchase Details
Purchase Records
Payment Records
Megabytes
Source:http://datameer.com
9/18/2016
3
4. CONT.,
• Walmart handles more than 1 million customer transactions every
hour
• Facebook handles 40 billion photos from its user base
• Decoding the human genome originally took 10years to process;
now it can be achieved in one week
4
9/18/2016
6. HADOOP
• As data is growing, we need to be able to scale-out computation
• Uses cheap(er) hardware to grow horizontally
• Tolerates a few machines going down
• Happens all the time
• Stores all your data from all system
• No need to throw your data
9/18/2016 6
8. HDFS
• Hadoop Distributed File System
• A distributed, scalable, and portable file system written in Java
for the Hadoop framework
• Provides high-throughput access to the application data
• Runs on large clusters of commodity machines
• Used to store large datasets
9/18/2016 8
9. CONT.,
9/18/2016 9
• A file we want to store on HDFS …
We’re raising the question
because no one else wants to,
because no one else wants to say
what needs to be said.
And let’s be real, it’s the two-ton
elephant in the room with nearly
every other star’s name on the
trade rumor radar these days.
We’ve read over and over again
about Nash refusing to ask for a
trade, refusing to play the game
that so many others have late in
their careers.
600 MB
10. CONT.,
9/18/2016 10
• HDFS Splits file into blocks …
We’re raising the question
because no one else wants to,
because no one else wants to say
what needs to be said.
And let’s be real, it’s the two-ton
elephant in the room with nearly
every other star’s name on the
trade rumor radar these days.
We’ve read over and over again
about Nash refusing to play the
game that so many others have
late in their careers.
256 MB
256 MB
88 MB
11. MAP REDUCE
• Distributed data processing model and execution environment
that runs on large clusters of commodity machines
• Also called MR
• Programs are inherently parallel
9/18/2016 11
13. PIG-INTRODUCTION
• High level data flow language for exploring very large datasets
• Provides an engine for executing data flows in parallel on Hadoop
• Compiler that produces sequences of MapReduce programs
• Structure is amenable to substantial parallelization
• Operates on files in HDFS
• Metadata is not required, but used when available
9/18/2016 13
14. KEY PROPERTIES OF PIG
• Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
• Optimization opportunities: Allows the user to focus on
semantics rather than efficiency
• Extensibility: Users can create their own functions to do
special-purpose processing
9/18/2016 14
18. PIG VS HADOOP
• 5% of the MR code.
• 5% of the MR development
time.
• Within 25% of the MR
execution time.
• Readable and reusable.
• Easy to learn DSL.
• Increases programmer
productivity.
• No Java expertise required.
• Anyone [eg. BI folks] can
trigger the Jobs.
• Insulates against Hadoop
complexity
• Version upgrades
• Changes in Hadoop interfaces
• JobConf configuration tuning
• Job Chains
9/18/2016 18
19. PIG COMMANDS
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Write output to stdout
Foreach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets, based on filter conditions
19
9/18/2016
20. LOADING DATA
• LOAD
• Reads data from the file system
• Syntax
• LOAD ‘input’ [USING function] [AS schema];
• Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
9/18/2016 20
21. SCHEMA
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
• name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
• name is now a String (chararray), age is integer and gpa is float
9/18/2016 21
22. DESCRIBING SCHEME
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias
unknown”
9/18/2016 22
23. DUMP AND STORE
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
9/18/2016 23
24. REFERENCING FIELDS
• Fields are referred to by positional notation or by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas
• Eg: A = load ‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
9/18/2016 24
25. LIMIT
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
9/18/2016 25
26. FILTER
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray , employeesince:int ,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND
(NOT(name == ‘marcbenioff’));
9/18/2016 26
27. GROUP BY
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY
expression …] [PARALLEL n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray, employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince); 9/18/2016 27
28. AGGREGATION
• Pig provides a bunch of aggregation functions
• AVG
• COUNT
• COUNT_STAR
• SUM
• MAX
• MIN
9/18/2016 28
29. DEFINE
• Assigns an alias to a UDF
• Syntax
• DEFINE alias {function}
• Use DEFINE to specify a UDF function when:
• UDF has a long package name
• UDF constructor takes string parameters.
9/18/2016 29