Apache Pig is a platform for analyzing large datasets that operates on the Hadoop platform. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into sequences of MapReduce jobs for execution. Pig Latin provides operators for common data management tasks like filtering, joining, grouping and sorting to make analyzing large datasets easier.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
Pig is a dataflow language used for performing extract-transform-load (ETL) operations on Hadoop. It has a simple syntax called Pig Latin and translates scripts into MapReduce jobs. Pig was designed for data pipelines, research on raw data, and iterative data processing. It allows operations like joining, grouping, filtering datasets. Users can also define custom functions in Java. The document provides an overview of Pig and examples demonstrating how to load data, transform it using operations like filter and group, and define user-defined functions.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Apache Pig is a platform for analyzing large datasets that consists of a high-level data flow language called Pig Latin and an infrastructure for evaluating Pig Latin programs. Pig Latin scripts are compiled into sequences of MapReduce jobs that can run on Hadoop for large scale parallel processing. Pig aims to provide a simpler programming model than raw MapReduce while still allowing for optimization and parallelization of queries. Pig programs can be run interactively using the Grunt shell or by specifying a Pig Latin script to execute.
power point presentation on pig -hadoop frameworkbhargavi804095
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin, a language similar to SQL, to perform tasks like filtering, joining, grouping and ordering on Hadoop clusters. Pig Latin scripts are translated into MapReduce jobs which execute the tasks in parallel across nodes. Pig provides simple abstractions over complex MapReduce code and allows for complex data analysis without writing Java code.
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
The document provides information about Pig Latin, the data processing language used with Apache Pig. It discusses Pig Latin basics like the data model, relations, tuples, and fields. It also covers Pig Latin statements, loading and storing data, data types, relational operations like group, join, cross, union, and diagnostic operators like dump and describe.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
Pig is a dataflow language used for performing extract-transform-load (ETL) operations on Hadoop. It has a simple syntax called Pig Latin and translates scripts into MapReduce jobs. Pig was designed for data pipelines, research on raw data, and iterative data processing. It allows operations like joining, grouping, filtering datasets. Users can also define custom functions in Java. The document provides an overview of Pig and examples demonstrating how to load data, transform it using operations like filter and group, and define user-defined functions.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Apache Pig is a platform for analyzing large datasets that consists of a high-level data flow language called Pig Latin and an infrastructure for evaluating Pig Latin programs. Pig Latin scripts are compiled into sequences of MapReduce jobs that can run on Hadoop for large scale parallel processing. Pig aims to provide a simpler programming model than raw MapReduce while still allowing for optimization and parallelization of queries. Pig programs can be run interactively using the Grunt shell or by specifying a Pig Latin script to execute.
power point presentation on pig -hadoop frameworkbhargavi804095
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin, a language similar to SQL, to perform tasks like filtering, joining, grouping and ordering on Hadoop clusters. Pig Latin scripts are translated into MapReduce jobs which execute the tasks in parallel across nodes. Pig provides simple abstractions over complex MapReduce code and allows for complex data analysis without writing Java code.
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
The document provides information about Pig Latin, the data processing language used with Apache Pig. It discusses Pig Latin basics like the data model, relations, tuples, and fields. It also covers Pig Latin statements, loading and storing data, data types, relational operations like group, join, cross, union, and diagnostic operators like dump and describe.
The document describes Apache Pig, a platform for analyzing large datasets. It consists of Pig Latin, a high-level language for expressing data analysis programs, and infrastructure for evaluating these programs by converting them into MapReduce jobs. The document provides an overview of Pig's features and use cases, describes how to install and run Pig in local and Hadoop modes, and gives examples of Pig Latin scripts for loading, transforming, and grouping data.
Pig is a platform for analyzing large datasets that sits between low-level MapReduce programming and high-level SQL queries. It provides a language called Pig Latin that allows users to specify data analysis programs without dealing with low-level details. Pig Latin scripts are compiled into sequences of MapReduce jobs for execution. HCatalog allows data to be shared between Pig, Hive, and other tools by reading metadata about schemas, locations, and formats.
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
This document provides an overview of Pig, a data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying and low-level procedural programming. Pig Latin programs are compiled into a logical plan and then optimized and executed as a series of MapReduce jobs. Compared to SQL in RDBMS, Pig Latin operates on nested data structures rather than flat tables and loads data directly from files rather than requiring an import process.
Pig is a data flow language and execution environment for exploring very large datasets. It runs on Hadoop and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying using high-level constructs as well as procedural programming using low-level operations. Pig Latin programs are compiled into a logical plan and then optimized and executed as MapReduce jobs on a Hadoop cluster. Pig supports complex nested data structures and user-defined functions.
This document describes how to analyze web server log files using the Pig Latin scripting language on Apache Hadoop. It provides examples of Pig Latin scripts to analyze logs and extract insights such as the top 50 external referrers, top search terms from Bing and Google, and total requests and bytes served by hour. Pig Latin scripts allow expressing data analysis programs for large datasets in a high-level language that can be optimized and executed in parallel on Hadoop for scalability.
PigHive presentation and hive impor.pptxRahul Borate
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin, a language similar to SQL, to transform and analyze their data without needing to write Java code. Pig scripts are compiled into sequences of MapReduce jobs that process data in parallel across a Hadoop cluster. Key features of Pig include data filtering, joining, grouping, and the ability to extend it with custom user-defined functions.
Keep hearing about Plack and PSGI, and not really sure what they're for, and why they're popular? Maybe you're using Plack at work, and you're still copying-and-pasting `builder` lines in to your code without really knowing what's going on? What's the relationship between Plack, PSGI, and CGI? Plack from first principles works up from how CGI works, the evolution that PSGI represents, and how Plack provides a user-friendly layer on top of that.
Apache Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin to transform and analyze their data without needing to write Java code. Pig Latin scripts are compiled into sequences of MapReduce jobs that process the data in parallel across large clusters. Pig is useful for data summarization, querying, reporting, and analysis of large datasets.
This document provides an overview of Apache Pig, including:
- Pig provides a higher level of abstraction for data users to access the power of Hadoop without writing Java code.
- The topics to be covered include an introduction to Pig, using Grunt shell, advanced operators, macros, embedding Pig Latin in Python, JSON parsing, XML parsing, UDFs, streaming, custom load/store functions, and performance tips.
- Pig is a platform for executing data flows in parallel on Hadoop using the Pig Latin language, which includes operators for traditional data operations and the ability for users to develop their own functions.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Apache Pig is a platform for analyzing large datasets that sits on top of Hadoop. It introduces Pig Latin, a high-level dataflow language that is compiled into MapReduce jobs. Pig Latin scripts organize data loading, transformation, and storage in a manner similar to SQL. Pig enables easy data summarization, querying, and analysis for large volumes of data without requiring Java programming.
Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.
This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.
This document provides an overview of Apache Hadoop, including what it is, how it works using MapReduce, and when it may be a good solution. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the parallel processing of large datasets in a reliable, fault-tolerant manner. The document discusses how Hadoop is used by many large companies, how it works based on the MapReduce paradigm, and recommends Hadoop for problems involving big data that can be modeled with MapReduce.
Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The architecture of Pig consists of a Pig Latin interpreter and runtime environment. Programs written in Pig Latin are compiled into a series of MapReduce jobs to perform tasks like join, filter, sort on large datasets without needing to write MapReduce code directly.
Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The main components of the Pig architecture are the Pig Latin parser and optimizer, which generate a logical plan, and the compiler, which converts this into a physical execution plan of MapReduce jobs. Pig aims to simplify big data analysis for users by hiding the complexity of MapReduce.
This document provides an overview of Pig, Hive, HBase, and machine learning concepts related to big data analytics. It discusses the features and execution modes of Pig, the architecture and components of Hive, and key differences between tools like Pig and Hive. Concepts covered include Pig Latin, data types in Pig, Hive Query Language (HQL), and the Hive client, services, and metastore.
Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.
Beautiful Monitoring With Grafana and InfluxDBleesjensen
Query your data streams with the time series database InfluxDB and then visualize the results with stunning Grafana dashboards. Quick and easy to set up. Fully scalable to millions of metrics per second.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
The document describes Pig, an open-source platform for analyzing large datasets. Pig provides a high-level language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into sequences of MapReduce jobs that operate in parallel on a Hadoop cluster. The document discusses how to install and run Pig locally or on a Hadoop cluster, and provides an example Pig Latin script that calculates the maximum temperature by year from sample weather data.
Software systems require ongoing maintenance to remain useful as their environments change. There are four main types of maintenance: corrective, adaptive, perfective, and preventative. Configuration management is important for tracking changes made during maintenance to ensure quality. It involves identifying software components, controlling different versions, approving changes, auditing changes, and reporting on changes. Maintaining software can be challenging due to various technical and organizational factors.
The document discusses various types and levels of testing in software engineering. It covers terminology, types of errors, quality assurance versus testing, and different levels of testing including unit testing, integration testing, system testing, and acceptance testing. Specific techniques like black-box testing and white-box testing are also summarized. The key points are that testing aims to find bugs but can never prove their absence, and that testing is done at the unit, integration, and system levels during the development process.
The document describes Apache Pig, a platform for analyzing large datasets. It consists of Pig Latin, a high-level language for expressing data analysis programs, and infrastructure for evaluating these programs by converting them into MapReduce jobs. The document provides an overview of Pig's features and use cases, describes how to install and run Pig in local and Hadoop modes, and gives examples of Pig Latin scripts for loading, transforming, and grouping data.
Pig is a platform for analyzing large datasets that sits between low-level MapReduce programming and high-level SQL queries. It provides a language called Pig Latin that allows users to specify data analysis programs without dealing with low-level details. Pig Latin scripts are compiled into sequences of MapReduce jobs for execution. HCatalog allows data to be shared between Pig, Hive, and other tools by reading metadata about schemas, locations, and formats.
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
This document provides an overview of Pig, a data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying and low-level procedural programming. Pig Latin programs are compiled into a logical plan and then optimized and executed as a series of MapReduce jobs. Compared to SQL in RDBMS, Pig Latin operates on nested data structures rather than flat tables and loads data directly from files rather than requiring an import process.
Pig is a data flow language and execution environment for exploring very large datasets. It runs on Hadoop and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying using high-level constructs as well as procedural programming using low-level operations. Pig Latin programs are compiled into a logical plan and then optimized and executed as MapReduce jobs on a Hadoop cluster. Pig supports complex nested data structures and user-defined functions.
This document describes how to analyze web server log files using the Pig Latin scripting language on Apache Hadoop. It provides examples of Pig Latin scripts to analyze logs and extract insights such as the top 50 external referrers, top search terms from Bing and Google, and total requests and bytes served by hour. Pig Latin scripts allow expressing data analysis programs for large datasets in a high-level language that can be optimized and executed in parallel on Hadoop for scalability.
PigHive presentation and hive impor.pptxRahul Borate
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin, a language similar to SQL, to transform and analyze their data without needing to write Java code. Pig scripts are compiled into sequences of MapReduce jobs that process data in parallel across a Hadoop cluster. Key features of Pig include data filtering, joining, grouping, and the ability to extend it with custom user-defined functions.
Keep hearing about Plack and PSGI, and not really sure what they're for, and why they're popular? Maybe you're using Plack at work, and you're still copying-and-pasting `builder` lines in to your code without really knowing what's going on? What's the relationship between Plack, PSGI, and CGI? Plack from first principles works up from how CGI works, the evolution that PSGI represents, and how Plack provides a user-friendly layer on top of that.
Apache Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin to transform and analyze their data without needing to write Java code. Pig Latin scripts are compiled into sequences of MapReduce jobs that process the data in parallel across large clusters. Pig is useful for data summarization, querying, reporting, and analysis of large datasets.
This document provides an overview of Apache Pig, including:
- Pig provides a higher level of abstraction for data users to access the power of Hadoop without writing Java code.
- The topics to be covered include an introduction to Pig, using Grunt shell, advanced operators, macros, embedding Pig Latin in Python, JSON parsing, XML parsing, UDFs, streaming, custom load/store functions, and performance tips.
- Pig is a platform for executing data flows in parallel on Hadoop using the Pig Latin language, which includes operators for traditional data operations and the ability for users to develop their own functions.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Apache Pig is a platform for analyzing large datasets that sits on top of Hadoop. It introduces Pig Latin, a high-level dataflow language that is compiled into MapReduce jobs. Pig Latin scripts organize data loading, transformation, and storage in a manner similar to SQL. Pig enables easy data summarization, querying, and analysis for large volumes of data without requiring Java programming.
Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.
This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.
This document provides an overview of Apache Hadoop, including what it is, how it works using MapReduce, and when it may be a good solution. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the parallel processing of large datasets in a reliable, fault-tolerant manner. The document discusses how Hadoop is used by many large companies, how it works based on the MapReduce paradigm, and recommends Hadoop for problems involving big data that can be modeled with MapReduce.
Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The architecture of Pig consists of a Pig Latin interpreter and runtime environment. Programs written in Pig Latin are compiled into a series of MapReduce jobs to perform tasks like join, filter, sort on large datasets without needing to write MapReduce code directly.
Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The main components of the Pig architecture are the Pig Latin parser and optimizer, which generate a logical plan, and the compiler, which converts this into a physical execution plan of MapReduce jobs. Pig aims to simplify big data analysis for users by hiding the complexity of MapReduce.
This document provides an overview of Pig, Hive, HBase, and machine learning concepts related to big data analytics. It discusses the features and execution modes of Pig, the architecture and components of Hive, and key differences between tools like Pig and Hive. Concepts covered include Pig Latin, data types in Pig, Hive Query Language (HQL), and the Hive client, services, and metastore.
Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.
Beautiful Monitoring With Grafana and InfluxDBleesjensen
Query your data streams with the time series database InfluxDB and then visualize the results with stunning Grafana dashboards. Quick and easy to set up. Fully scalable to millions of metrics per second.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
The document describes Pig, an open-source platform for analyzing large datasets. Pig provides a high-level language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into sequences of MapReduce jobs that operate in parallel on a Hadoop cluster. The document discusses how to install and run Pig locally or on a Hadoop cluster, and provides an example Pig Latin script that calculates the maximum temperature by year from sample weather data.
Software systems require ongoing maintenance to remain useful as their environments change. There are four main types of maintenance: corrective, adaptive, perfective, and preventative. Configuration management is important for tracking changes made during maintenance to ensure quality. It involves identifying software components, controlling different versions, approving changes, auditing changes, and reporting on changes. Maintaining software can be challenging due to various technical and organizational factors.
The document discusses various types and levels of testing in software engineering. It covers terminology, types of errors, quality assurance versus testing, and different levels of testing including unit testing, integration testing, system testing, and acceptance testing. Specific techniques like black-box testing and white-box testing are also summarized. The key points are that testing aims to find bugs but can never prove their absence, and that testing is done at the unit, integration, and system levels during the development process.
The document provides an overview of the Constructive Cost Model (COCOMO), which was proposed by Boehm in 1981 as a heuristic project estimation technique. It describes the three stages of COCOMO (Basic, Intermediate, and Complete) and explains the Basic COCOMO model in detail. The Basic COCOMO model estimates effort and development time based on lines of code and complexity level (Organic, Semidetached, Embedded). It also discusses COCOMO II and its four sub-models for different project stages.
pointer in c through addressing modes esntial in cssuser2d043c
This document discusses pointers in C programming. It defines pointers as variables that store memory addresses and explains that pointers allow indirect referencing of values. It describes how to declare and initialize pointers, use the address (&) and indirection (*) operators, and how pointers can be used to simulate pass by reference. The document also covers pointer arithmetic, the relationship between pointers and arrays, arrays of pointers, pointers to functions, and using pointers to implement a stack data structure with push and pop operations. It provides examples of calculating execution time by getting the clock before and after running code.
System engineering is related to software engineeringssuser2d043c
A system engineer manages overall engineering projects from requirements to solutions using an interdisciplinary approach. They focus on both physical and technical aspects as well as hardware, software, and processes. Key methods include stakeholder analysis, interface specification, design tradeoffs, configuration management, and systematic verification and validation. A software engineer designs and develops quality software applications and products using systematic processes for design, development, deployment, and maintenance while focusing on software development, infrastructure, control, applications and databases. Their key methods include process modeling, incremental verification and validation, process improvement, model-driven development, agile methods, and continuous integration.
This document provides an overview and introduction to the R programming language. It covers what R is, how to install R, conducting a first R session, R basics like vectors, arithmetic, logical operations and more. The last section provides exercises for practicing the concepts covered. R is introduced as a language for statistical computing and graphics that provides effective data handling, statistical techniques and graphical displays. It is open source, has a large user community and many extension packages available.
The document contains slides from supplementary materials for a software engineering textbook. It discusses definitions of software, different types of software applications, challenges with legacy software, and concepts around software evolution. It also lists "laws" of software evolution and notes that software myths can lead to bad decisions if not grounded in reality. The slides are copyrighted and intended solely for use alongside the textbook.
The document provides an overview of software engineering, discussing what it is, why it is important, common challenges, and key concepts. It defines software engineering as the application of engineering principles to software development. Major points covered include the software crisis that led to its emergence as a discipline, examples of costly software failures, attributes of good software like maintainability and dependability, different software development models and their costs, and ongoing challenges like managing heterogeneity.
This document discusses web crawling and indexes. It begins by outlining the basic process of crawling, including using seed URLs and placing extracted URLs in a queue. It notes complications like distributed crawling and politeness policies. Key aspects that crawlers must do include being polite, robust, scalable and handling quality/freshness. Crawlers should respect robots.txt files and distribute workload. The document outlines the URL frontier approach used to balance politeness and freshness, using prioritized front queues and per-host back queues with timing gaps between requests.
This document introduces object-oriented programming (OOP) by discussing the differences between structured and OOP, key OOP terminology like objects, classes, methods and attributes, and the four main design principles of OOP: encapsulation, abstraction, polymorphism, and inheritance. It provides examples of attributes and methods for a car class and explains the benefits of OOP like code reuse and easier debugging. Finally, it briefly covers some popular OOP programming languages like Java, C++ and Smalltalk.
This document discusses state diagrams and statecharts. It introduces key concepts such as states, transitions, events, actions, and activities. States represent conditions or situations of an object, and transitions occur between states in response to events. State diagrams can show nested substates and concurrent states using orthogonal components separated by dashed lines. The document provides examples and notation for drawing state diagrams to model the behavior of objects.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
2. What is
Pig?
• Apache Pig is an abstraction over MapReduce.
• It is a tool/platform which is used to analyze larger sets of
data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all
the data manipulation operations in Hadoop using
Apache Pig.
• To write data analysis programs, Pig provides a high-level
language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for reading,
writing, and processing data.
4. • To analyze data using Apache Pig, programmers need
to write scripts using Pig Latin language.
• All these scripts are internally converted to Map and
Reduce tasks.
• Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those
scripts into MapReduce jobs.
5. Features of
Pig
• Rich set of operators: It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• UDF’s: Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed
them in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
6. Apache Pig Vs
Hive
• Both Apache Pig and Hive are used to create MapReduce jobs. And in
some cases, Hive operates on HDFS in a similar way Apache Pig does.
9. Pig Execution
Modes
• You can run Apache Pig in two modes.
• Local Mode
– In this mode, all the files are installed and run from your local
host and local file system. There is no need of Hadoop or
HDFS. This mode is generally used for testing purpose.
• MapReduce Mode
– MapReduce mode is where we load or process the data that exists
in the Hadoop File System (HDFS) using Apache Pig. In this mode,
whenever we execute the Pig Latin statements to process the data,
a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
11. Execution
Mechanisms
• Interactive Mode (Grunt shell) – You can run Apache Pig in
interactive mode using the Grunt shell. In this shell, you can
enter the Pig Latin statements and get the output (using Dump
operator).
• Batch Mode (Script) – You can run Apache Pig in Batch mode by
writing the Pig Latin script in a single file with .pig extension.
• Embedded Mode (UDF) – Apache Pig provides the provision of
defining our own functions (User Defined Functions) in
programming languages such as Java, and using them in our script.
12. • Interactive Mode:
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',');
grunt> dump customers;
• Batch Mode (Local):
[cloudera@quickstart ~]$ cat pig_samplescript_local.pig
customers= LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;
[cloudera@quickstart ~]$ pig -x local
pig_samplescript_local.pig
14. Diagnostic
Operators
• The load statement will simply load the data into the
specified relation in Apache Pig. To verify the execution
of
the Load statement, you have to use the Diagnostic
Operators.
• Pig Latin provides four different types of diagnostic
operators:
– Dump operator
– Describe operator
– Explanation operator
15. • Dump operator
• The Dump operator is used to run the Pig Latin
statements and display the results on the screen. It is
generally used for debugging Purpose.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> dump customers;
16. • Describe operator
• The describe operator is used to view the
schema of a relation/bag.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> describe customers;
customers: {id: int,name: chararray,age:
int,address:
chararray,salary: int}
17. • Explain operator
• The explain operator is used to display the logical,
physical, and MapReduce execution plans of a
relation/bag.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> explain customers;
18. • Illustrate operator
• The illustrate operator is used to display the logical,
physical, and MapReduce execution plans of a
relation/bag.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> illustrate customers;
20. Group
Operator
• The GROUP operator is used to group the data in one
or more relations. It collects the data having the same
key.
• grunt> student_details = LOAD '/home/cloudera/students.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
• grunt> student_groupdata = GROUP student_details by age;
24. Join
Operator
• The JOIN operator is used to combine records from
two or more relations.
• Types of Joins:
– Self-join
– Inner-join
– Outer join : left join, right join, full join
25. Self
Join
• customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as
(id:int,
name:chararray, age:int, address:chararray, salary:int);
• orders = LOAD '/home/local/orders.txt' USING PigStorage(',') as (oid:int,
date:chararray, customer_id:int, amount:int);
• grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int,
name:chararray, age:int, address:chararray, salary:int);
• grunt> customers2 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
• grunt> Dump customers3;
26. Inner Join
(equijoin)
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> customer_orders = JOIN customers BY id, orders BY
customer_id;
• grunt> dump customer_orders;
27. Left Outer
Join
• The left outer Join operation returns all rows from the left table,
even if there are no matches in the right relation.
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> outer_left = JOIN customers BY id LEFT OUTER,
orders BY customer_id;
• grunt> Dump outer_left;
28. Right Outer
Join
• The right outer join operation returns all rows from the right table,
even if there are no matches in the left table.
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> outer_right = JOIN customers BY id RIGHT, orders BY
customer_id;
• grunt> Dump outer_right;
29. Full Outer
Join
• The full outer join operation returns rows when there is a match in
one of the relations.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY
customer_id;
• grunt> Dump outer_full;
32. Union
Operator
• The UNION operator of Pig Latin is used to merge the
content of two relations. To perform UNION operation
on two relations, their columns and domains must be
identical.
33. • grunt> student1 = LOAD
'/home/cloudera/student_data1.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student2 = LOAD
'/home/cloudera/student_data2.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student = UNION student1, student2;
• grunt> dump student;
35. • grunt> student_details = LOAD '/home/cloudera/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
• Let us now split the relation into two, one listing the students age
less than 23, and the other listing the students having the age
between 23 and 25.
• SPLIT student_details into student_details1 if age<23,
student_details2 if (age>23 and age<25);
• grunt> Dump student_details1;
• grunt> Dump student_details2;
37. Filter
Operator
• The FILTER operator is used to select the required
tuples from a relation based on a condition.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> filter_data = FILTER student_details BY city ==
'Chennai';
• grunt> dump filter_data;
38. Distinct
Operator
• The DISTINCT operator is used to remove
redundant (duplicate) tuples from a relation.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> distinct_data = DISTINCT student_details;
• grunt> dump distinct_data;
39. Foreach
Operator
• The FOREACH operator is used to generate
specified data transformations based on the column
data.
grunt> student_details = LOAD
‘/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int,
phone:chararray, city:chararray);
• get the id, age, and city values of each student from
the relation student_details and store it into
another relation named foreach_data using the
foreach operator.
41. Order
By
• The ORDER BY operator is used to display the
contents of a relation in a sorted order based on one
or more fields.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> order_by_data = ORDER student_details BY age
DESC;
45. AVG
()
• Computes the average of the numeric values in a single-column bag.
• grunt> A = LOAD '/home/cloudera/student.txt' USING PigStorage(',') as
(name:chararray, term:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
• grunt> DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
46. CONCA
T()
• Concatenates two expressions of identical type.
• grunt>A = LOAD ‘/home/Cloudera/data.txt' as
(f1:chararray, f2:chararray, f3:chararray);
• grunt>DUMP A;
(apache,open,sourc
e)
(hadoop,map,reduc
e) (pig,pig,latin)
• grunt>X = FOREACH A GENERATE CONCAT(f2,f3);
• grunt>DUMP X;
(opensourc
e)
(mapreduce
47. COUN
T
• Computes the number of elements in a bag.
• Note: You cannot use the tuple designator (*) with
COUNT; that is, COUNT(*) will not work.
48. • grunt>A = LOAD '/home/cloudera/c.txt' USING PigStorage(',') as (f1:int, f2:int,
f3:int);
• grunt>DUMP A;
1,2,3
4,2,nul
l 8,3,4
4,3,nul
l
7,5,nul
l 8,4,3
• grunt>B = GROUP A BY f1;
• grunt>DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
• grunt>X = FOREACH B GENERATE COUNT(A);
• grunt>DUMP X;
(1L)
(2L)
(1L)
49. COUNT_ST
AR
• Computes the number of elements in a bag.
• COUNT_STAR includes NULL values in the count
computation (unlike COUNT, which ignores NULL
values).
• Example
• In this example COUNT_STAR is used the count the
tuples in a bag.
• grunt>X = FOREACH B GENERATE COUNT_STAR(A);
50. DIF
F
• Compares two fields in a tuple.
• grunt> A = LOAD ‘/home/Cloudera/data.txt' AS
(B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:i
nt)});
• grunt> DUMP A;
({(8,9),(0,1)},{(8,9),(1,1
)})
({(2,3),(4,5)},{(2,3),(4,5
)})
({(6,7),(3,7)},{(2,2),(3,7
)})
• grunt> DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
• grunt> X = FOREACH A DIFF(B1,B2);
• grunt> dump
X;
51. MA
X
• Computes the maximum of the numeric values or
chararrays in a single-column bag. MAX requires a
preceding GROUP ALL statement for global maximums
and a GROUP BY statement for group maximums.
• Example
– In this example the maximum GPA for all terms is computed
for each student (see the GROUP operator for information
about the field names in relation B).
52. • grunt> A = LOAD ‘home/Cloudera/student.txt' AS (name:chararray, session:chararray,
gpa:float);
• grunt> DUMP
A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3
.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,
4.0F)})
53. MI
N
• Computes the minimum of the numeric values or
chararrays in a single-column bag. MIN requires a
preceding GROUP… ALL statement for global
minimums and a GROUP … BY statement for group
minimums.
• Example
– In this example the minimum GPA for all terms is computed for
each student (see the GROUP operator for information about
the field names in relation B).
54. • grunt> A = LOAD ‘/home/Cloudera/student.txt' AS (name:chararray, session:chararray,
gpa:float);
• grunt> DUMP
A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3
.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,
4.0F)})
• grunt> X = FOREACH B GENERATE group, MIN(A.gpa);
• grunt> DUMP
X; (John,3.7F)
(Mary,3.8F)
55. SIZ
E
• Computes the number of elements based on any Pig data type.
• Example
• In this example the number of characters in the first field is
computed.
• grunt> A = LOAD 'data' as (f1:chararray, f2:chararray,
f3:chararray);
(apache,open,sourc
e)
(hadoop,map,reduc
e) (pig,pig,latin)
• grunt> X = FOREACH A GENERATE SIZE(f1);
• grunt> DUMP X;
(6L)
(6L)
(3L)
56. SU
M
• Computes the sum of the numeric values in a
single-column bag. SUM requires a preceding
GROUP ALL statement for global sums and a
GROUP BY statement for group sums.
• Example
• In this example the number of pets is computed.
57. • grunt> A = LOAD ‘/home/Cloudera/data' AS (owner:chararray,
pet_type:chararray, pet_num:int);
• grunt> DUMP
A;
(Alice,turtle,1)
(Alice,goldfish,5
) (Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
• grunt> B = GROUP A BY owner;
• grunt> DUMP B;
(Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat,
2)}) (Bob,{(Bob,dog,2),(Bob,cat,2)})
• grunt> X = FOREACH B GENERATE group, SUM(A.pet_num);
• DUMP
X;
(Alice,8L
)
(Bob,4L)
59. ENDSWITH ,
STARTSWITH
• ENDSWITH - This function accepts two String
parameters, it is used to verify whether the first string
ends with the second. string.
• STARTSWITH - This function accepts two string
parameters. It verifies whether the first string starts
with the second.
• emp.txt
61. SUBSTRIN
G()
• This function returns a substring from the given
string.
• EMP.TXT
001,Robin,22,newyork
002,Stacy,25,Bhuwanesh
war 003,Kelly,22,Chennai
63. EqualsIgnoreCas
e()
• The EqualsIgnoreCase() function is used to compare
two strings and verify whether they are equal. If both are
equal this function returns the Boolean value true else it
returns the value false.
65. UPPER(),
LOWER()
• UPPER- This function is used to convert all the
characters in a string to uppercase.
• LOWER- This function is used to convert all the
characters in a string to lowercase.
69. TRIM(), RTRIM(),
LTRIM()
• The TRIM() function accepts a string and returns its copy after
removing the unwanted spaces before and after it.
• The function LTRIM() is same as the function TRIM(). It removes
the unwanted spaces from the left side of the given string
(heading spaces).
• The function RTRIM() is same as the function TRIM(). It
removes the unwanted spaces from the right side of a given
string (tailing spaces).
72. ToDate
()
• This function is used to generate a DateTime object
according to the given parameters.
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
73. • grunt> date_data = LOAD ‘/home/cloudera/date.txt'
USING PigStorage(',') as (id:int,date:chararray);
• grunt> todate_data = foreach date_data generate
ToDate(date,'yyyy/MM/dd HH:mm:ss') as
(date_time:DateTime);
• grunt> Dump todate_data;
(1989-09-26T09:00:00.000+05:
30)
(1980-06-20T10:22:00.000+05:
30)
74. GetDay
()
• This function accepts a date-time object as a
parameter and returns the current day of the given
date-time object.
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
76. User Defined
Functions
• Apache Pig provides extensive
support for User Defined
Functions (UDF’s).
• Using these UDF’s, we can define our own functions
and use them.
• The UDF support is provided in six programming
languages. Java, Jython, Python, JavaScript, Ruby
and Groovy.
77. Creating
UDF’S
• Open Eclipse and create a new project.
• Convert the newly created project into a Maven project.
• Copy the pom.xml. This file contains the Maven
dependencies for Apache Pig and Hadoop-core jar files.