This document summarizes Alan F. Gates' presentation on Pig, an open source tool for analyzing large datasets. Pig provides a high-level language called Pig Latin for expressing data analysis processes, which are executed as MapReduce jobs on a Hadoop cluster. Key points include:
- Pig Latin scripts are compiled into MapReduce jobs for parallel execution on Hadoop
- Pig supports common operations like load, filter, group, join, and store and integrates with user-defined functions
- Pig is used by many companies for large-scale data processing tasks like analyzing web server logs and building user behavior models
This document describes how to analyze web server log files using the Pig Latin scripting language on Apache Hadoop. It provides examples of Pig Latin scripts to analyze logs and extract insights such as the top 50 external referrers, top search terms from Bing and Google, and total requests and bytes served by hour. Pig Latin scripts allow expressing data analysis programs for large datasets in a high-level language that can be optimized and executed in parallel on Hadoop for scalability.
1. Hadoop is used extensively at Twitter to handle large volumes of data from logs and other sources totaling 7TB per day. Tools like Scribe and Crane are used to input data and Elephant Bird and HBase for storage.
2. Pig is used for data analysis on these large datasets to perform tasks like counting, correlating, and researching trends in users and tweets.
3. The results of these analyses are used to power various internal and external Twitter products and keep the business agile through ad-hoc analyses.
This document presents a framework for stock analysis using Hadoop. It outlines using Hive and MapReduce to analyze stock data from the NYSE to obtain adjusted closing share prices after dividend distributions. The technical architecture involves using MapReduce programs and Hive tables. Code examples in Hive and MapReduce are provided to load and clean the data, perform inner joins, and calculate adjusted closing prices on dividend dates. The results found the adjusted closing prices and business implications include examining historical trends to encourage investment and show efficient company performance meeting shareholder expectations.
Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.
This document provides information about a Pig workshop being conducted by Sudar Muthu. It introduces Pig, describing it as a platform for analyzing large data sets. It outlines some key Pig components like the Pig Shell, Pig Latin language, libraries, and user defined functions. It also discusses why Pig is useful, highlighting aspects like its data flow model and ability to increase programmer productivity. Finally, it previews topics that will be covered in the workshop, such as loading and storing data, Pig Latin operators, and writing user defined functions.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
This document provides an overview of the Pig Latin data flow language for Hadoop. It discusses why Pig is useful for increasing productivity and insulating users from complexity when working with MapReduce. The document provides examples of simple Pig Latin scripts for common tasks like filtering, joining, grouping and aggregation. It also covers performance considerations, user defined functions, common pitfalls, and recommendations.
This document provides an overview and introduction to BigData using Hadoop and Pig. It begins with introducing the speaker and their background working with large datasets. It then outlines what will be covered, including an introduction to BigData, Hadoop, Pig, HBase and Hive. Definitions and examples are provided for each. The remainder of the document demonstrates Hadoop and Pig concepts and commands through code examples and explanations.
This document describes how to analyze web server log files using the Pig Latin scripting language on Apache Hadoop. It provides examples of Pig Latin scripts to analyze logs and extract insights such as the top 50 external referrers, top search terms from Bing and Google, and total requests and bytes served by hour. Pig Latin scripts allow expressing data analysis programs for large datasets in a high-level language that can be optimized and executed in parallel on Hadoop for scalability.
1. Hadoop is used extensively at Twitter to handle large volumes of data from logs and other sources totaling 7TB per day. Tools like Scribe and Crane are used to input data and Elephant Bird and HBase for storage.
2. Pig is used for data analysis on these large datasets to perform tasks like counting, correlating, and researching trends in users and tweets.
3. The results of these analyses are used to power various internal and external Twitter products and keep the business agile through ad-hoc analyses.
This document presents a framework for stock analysis using Hadoop. It outlines using Hive and MapReduce to analyze stock data from the NYSE to obtain adjusted closing share prices after dividend distributions. The technical architecture involves using MapReduce programs and Hive tables. Code examples in Hive and MapReduce are provided to load and clean the data, perform inner joins, and calculate adjusted closing prices on dividend dates. The results found the adjusted closing prices and business implications include examining historical trends to encourage investment and show efficient company performance meeting shareholder expectations.
Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.
This document provides information about a Pig workshop being conducted by Sudar Muthu. It introduces Pig, describing it as a platform for analyzing large data sets. It outlines some key Pig components like the Pig Shell, Pig Latin language, libraries, and user defined functions. It also discusses why Pig is useful, highlighting aspects like its data flow model and ability to increase programmer productivity. Finally, it previews topics that will be covered in the workshop, such as loading and storing data, Pig Latin operators, and writing user defined functions.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
This document provides an overview of the Pig Latin data flow language for Hadoop. It discusses why Pig is useful for increasing productivity and insulating users from complexity when working with MapReduce. The document provides examples of simple Pig Latin scripts for common tasks like filtering, joining, grouping and aggregation. It also covers performance considerations, user defined functions, common pitfalls, and recommendations.
This document provides an overview and introduction to BigData using Hadoop and Pig. It begins with introducing the speaker and their background working with large datasets. It then outlines what will be covered, including an introduction to BigData, Hadoop, Pig, HBase and Hive. Definitions and examples are provided for each. The remainder of the document demonstrates Hadoop and Pig concepts and commands through code examples and explanations.
The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
1. The data is loaded from a file into relation 'divs' with specified data types
2. A filter is applied to 'divs' to only keep records where the symbol field matches the regular expression 'CM.*'
3. The filtered relation is stored in 'startswithcm'
The script loads data from a file, applies a regular expression filter to select records where the symbol starts with "CM", and stores the filtered relation. It performs a basic extract, filter, and store workflow in Pig Latin.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
apache pig performance optimizations talk at apachecon 2010Thejas Nair
Pig provides a high-level language called Pig Latin for analyzing large datasets. It optimizes Pig Latin scripts by restructuring the logical query plan through techniques like predicate pushdown and operator rewriting, and by generating efficient physical execution plans that leverage features like combiners, different join algorithms, and memory management. Future work aims to improve memory usage and allow joins and groups within a single MapReduce job when keys are the same.
The document discusses Hive, an open source data warehousing system built on Hadoop that allows users to query large datasets using SQL. It describes Hive's data model, architecture, query language features like joins and aggregations, optimizations, and provides examples of how queries are executed using MapReduce. The document also covers Hive's metastore, external tables, data types, and extensibility features.
Apache Pig is a platform for analyzing large datasets that consists of a high-level data flow language called Pig Latin and an infrastructure for evaluating Pig Latin programs. Pig Latin scripts are compiled into sequences of MapReduce jobs that can run on Hadoop for large scale parallel processing. Pig aims to provide a simpler programming model than raw MapReduce while still allowing for optimization and parallelization of queries. Pig programs can be run interactively using the Grunt shell or by specifying a Pig Latin script to execute.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Learning Objectives - In this module, you will learn what is Pig, in which type of use case we can use Pig, how Pig is tightly coupled with MapReduce, and Pig Latin scripting.
The document discusses Pig Latin scripts and how to execute them. It provides examples of multi-line and single-line comments in Pig Latin scripts. It also describes how to execute Pig Latin scripts locally or using MapReduce and how to execute scripts that reside in HDFS. Finally, it summarizes key differences between Pig Latin and SQL and describes common relational operators used in Pig Latin.
Hive User Meeting March 2010 - Hive TeamZheng Shao
The document summarizes new features and API updates in the Hive data warehouse software. It discusses enhancements to JDBC/ODBC connectivity, the introduction of CREATE TABLE AS SELECT (CTAS) functionality, improvements to join strategies including map joins and bucketed map joins, and work on views, HBase integration, user-defined functions, serialization/deserialization (SerDe), and object inspectors. It also provides guidance on developing new SerDes for custom data formats and serialization needs.
Apache Pig is a platform for analyzing large datasets that consists of a high-level language called Pig Latin for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig Latin scripts are compiled into a series of MapReduce jobs that are executed to process the data. Pig provides tools for loading, storing, filtering, grouping, joining, and other operations on large datasets in parallel across a cluster using Hadoop. It aims to abstract away the complexity of MapReduce to make the data analysis process easier for analysts.
Apache Pig is a platform for analyzing large datasets that runs on Hadoop. It provides a high-level language called Pig Latin that allows users to write data analysis programs without having to write complex MapReduce code in Java. Pig Latin scripts are compiled into MapReduce jobs. Pig offers features like built-in operators for joins, filters, and ordering and can handle both structured and unstructured data.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
1) Pig 0.8 focuses on usability, integration, performance, and backwards compatibility with 0.7.
2) It allows UDFs to be written in scripting languages like Jython and provides better statistics and integration with Hadoop job history.
3) Other new features include invoking static Java functions as UDFs, improved HBase integration, and casting relations to scalars.
Pig is a scripting language used for data analysis on Apache Hadoop. Pig scripts describe data flows and can perform all necessary data manipulations on Hadoop. Pig also allows integration with other languages through User Defined Functions. The document then provides steps to upload sample baseball data files to Hue and write a Pig Latin script to load, filter, group, and join the data to find the maximum runs scored by players each year.
The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
1. The data is loaded from a file into relation 'divs' with specified data types
2. A filter is applied to 'divs' to only keep records where the symbol field matches the regular expression 'CM.*'
3. The filtered relation is stored in 'startswithcm'
The script loads data from a file, applies a regular expression filter to select records where the symbol starts with "CM", and stores the filtered relation. It performs a basic extract, filter, and store workflow in Pig Latin.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
apache pig performance optimizations talk at apachecon 2010Thejas Nair
Pig provides a high-level language called Pig Latin for analyzing large datasets. It optimizes Pig Latin scripts by restructuring the logical query plan through techniques like predicate pushdown and operator rewriting, and by generating efficient physical execution plans that leverage features like combiners, different join algorithms, and memory management. Future work aims to improve memory usage and allow joins and groups within a single MapReduce job when keys are the same.
The document discusses Hive, an open source data warehousing system built on Hadoop that allows users to query large datasets using SQL. It describes Hive's data model, architecture, query language features like joins and aggregations, optimizations, and provides examples of how queries are executed using MapReduce. The document also covers Hive's metastore, external tables, data types, and extensibility features.
Apache Pig is a platform for analyzing large datasets that consists of a high-level data flow language called Pig Latin and an infrastructure for evaluating Pig Latin programs. Pig Latin scripts are compiled into sequences of MapReduce jobs that can run on Hadoop for large scale parallel processing. Pig aims to provide a simpler programming model than raw MapReduce while still allowing for optimization and parallelization of queries. Pig programs can be run interactively using the Grunt shell or by specifying a Pig Latin script to execute.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Learning Objectives - In this module, you will learn what is Pig, in which type of use case we can use Pig, how Pig is tightly coupled with MapReduce, and Pig Latin scripting.
The document discusses Pig Latin scripts and how to execute them. It provides examples of multi-line and single-line comments in Pig Latin scripts. It also describes how to execute Pig Latin scripts locally or using MapReduce and how to execute scripts that reside in HDFS. Finally, it summarizes key differences between Pig Latin and SQL and describes common relational operators used in Pig Latin.
Hive User Meeting March 2010 - Hive TeamZheng Shao
The document summarizes new features and API updates in the Hive data warehouse software. It discusses enhancements to JDBC/ODBC connectivity, the introduction of CREATE TABLE AS SELECT (CTAS) functionality, improvements to join strategies including map joins and bucketed map joins, and work on views, HBase integration, user-defined functions, serialization/deserialization (SerDe), and object inspectors. It also provides guidance on developing new SerDes for custom data formats and serialization needs.
Apache Pig is a platform for analyzing large datasets that consists of a high-level language called Pig Latin for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig Latin scripts are compiled into a series of MapReduce jobs that are executed to process the data. Pig provides tools for loading, storing, filtering, grouping, joining, and other operations on large datasets in parallel across a cluster using Hadoop. It aims to abstract away the complexity of MapReduce to make the data analysis process easier for analysts.
Apache Pig is a platform for analyzing large datasets that runs on Hadoop. It provides a high-level language called Pig Latin that allows users to write data analysis programs without having to write complex MapReduce code in Java. Pig Latin scripts are compiled into MapReduce jobs. Pig offers features like built-in operators for joins, filters, and ordering and can handle both structured and unstructured data.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
1) Pig 0.8 focuses on usability, integration, performance, and backwards compatibility with 0.7.
2) It allows UDFs to be written in scripting languages like Jython and provides better statistics and integration with Hadoop job history.
3) Other new features include invoking static Java functions as UDFs, improved HBase integration, and casting relations to scalars.
Pig is a scripting language used for data analysis on Apache Hadoop. Pig scripts describe data flows and can perform all necessary data manipulations on Hadoop. Pig also allows integration with other languages through User Defined Functions. The document then provides steps to upload sample baseball data files to Hue and write a Pig Latin script to load, filter, group, and join the data to find the maximum runs scored by players each year.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
Scaling python webapps from 0 to 50 million users - A top-down approachJinal Jhaveri
This document provides an overview of scaling a Python web application from 0 to 50 million users. It discusses key bottlenecks and solutions at different levels including the load balancer, web server, web application and browser. It emphasizes the importance of profiling, measuring and improving performance iteratively. Specific techniques mentioned include using Memcached to avoid database trips, asynchronous programming, compression, caching, and a performance strategy of measure, profile and improve.
This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.
This document discusses building a feature store using Apache Spark and dataframes. It provides examples of major feature store concepts like feature groups, training/test datasets, and joins. Feature store implementations from companies like Uber, Airbnb and Netflix are also mentioned. The document outlines the architecture of storing both online and offline feature groups and describes the evolution of the feature store API to better support concepts like feature versioning, multiple stores, complex joins and time travel. Use cases demonstrated include fraud detection in banking and modeling crop yields using joined weather and agricultural data.
Building a Feature Store around Dataframes and Apache SparkDatabricks
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform.
This document provides an overview of Apache Pig and Pig Latin for querying large datasets. It discusses why Pig was created due to limitations in SQL for big data, how Pig scripts are written in Pig Latin using a simple syntax, and how PigLatin scripts are compiled into MapReduce jobs and executed on Hadoop clusters. Advanced topics covered include user-defined functions in PigLatin for custom data processing and sharing functions through Piggy Bank.
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
This document provides an overview of Pig, a data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying and low-level procedural programming. Pig Latin programs are compiled into a logical plan and then optimized and executed as a series of MapReduce jobs. Compared to SQL in RDBMS, Pig Latin operates on nested data structures rather than flat tables and loads data directly from files rather than requiring an import process.
Pig is a data flow language and execution environment for exploring very large datasets. It runs on Hadoop and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying using high-level constructs as well as procedural programming using low-level operations. Pig Latin programs are compiled into a logical plan and then optimized and executed as MapReduce jobs on a Hadoop cluster. Pig supports complex nested data structures and user-defined functions.
Pig is a platform for analyzing large datasets that sits between low-level MapReduce programming and high-level SQL queries. It provides a language called Pig Latin that allows users to specify data analysis programs without dealing with low-level details. Pig Latin scripts are compiled into sequences of MapReduce jobs for execution. HCatalog allows data to be shared between Pig, Hive, and other tools by reading metadata about schemas, locations, and formats.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
The document discusses polyalgebra, an extended form of relational algebra that can handle complex data types like nested records and streaming data. It allows various data processing engines and SQL query engines to operate over different data sources using a single optimization framework. The document outlines the ecosystem of data stores, engines, and frameworks that can be used with polyalgebra and Calcite's rule-based query planning system. It provides examples of how relational algebra expressions capture the logic of SQL queries and how rules are used to optimize query plans.
1. The document discusses using Hadoop and Hive at Zing to build a log collecting, analyzing, and reporting system.
2. Scribe is used for fast log collection and storing data in Hadoop/Hive. Hive provides SQL-like queries to analyze large datasets.
3. The system transforms logs into Hive tables, runs analysis jobs in Hive, then exports data to MySQL for web reporting. This provides a scalable, high performance solution compared to the initial RDBMS-only system.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...Codemotion
Front-end development has an amazing assortment of libraries and tools, yet it can seem very complex and doest seem much fun. So we'll live code a ClojureScript application (with a bit of help from Git) and show how development doesn't have to be complex or slow. Through live evaluation, we can build a reactive, functional application. Why not take a look at a well designed language that uses modern functional & reactive concepts for building Front-End apps. You are going to have to trans-pile anyway, so why not use a language, libraries and tooling that is bursting with fun to use.
The document discusses using Parse Cloud Code to build web applications, including basic operations like create, read, update, delete, how Parse and RESTful APIs work, and how to use Cloud Code to call external APIs, run background jobs, and include other JavaScript modules.
Have you ever left out a comma in your payload calling an API on the command line? Have you cURL’d a REST API only to realize you forgot that crucial query parameter? If only the command line could take advantage of some structured description of the API model…
APIs can be challenging to understand and consume from the command line, a utility most developers use daily. In this session we will look at an approach to generating intuitive command line utilities for gRPC APIs.
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
1. The document discusses building a minimal viable prediction service (MVP) to predict air quality using only Python and free serverless services in 90 minutes.
2. It describes creating feature, training, and inference pipelines to build an air quality prediction service using Hopsworks, Modal, and Streamlit/Gradio.
3. The pipelines would extract features from weather and air quality data, train a model, and deploy an inference pipeline to make predictions on new data.
Similar to Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate (20)
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
This document discusses developing mobile apps for performance. It emphasizes that user perceived latency, stability, and battery life matter most to users. A key performance indicator is cold app launch time, which should be under 2 seconds to keep users happy. Measuring app performance is challenging as it needs to account for different devices, networks, and conditions. The document recommends reducing network calls to load the home screen faster by fetching content in the user's viewpoint with a single endpoint and network call.
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
Athenz is an open-source solution that provides access control for dynamic infrastructures. It offers service authentication through secure identity in the form of x.509 certificates for every service. It also provides fine-grained role-based access control (RBAC). Athenz aims to solve problems around identity and policy that are common in large infrastructures. It acts as a single source of truth for access control across multiple cloud computing environments like Kubernetes and OpenStack.
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
This document discusses containerization on Apache Hadoop YARN. It introduces YARN container runtimes, which allow containers like Docker to run on YARN. This enables easier onboarding of new applications. The YARN services framework provides tools for long-running services on YARN through components, configurations, and lifecycle management. YARN service discovery allows services to find each other through a registry exposed via DNS. Recent improvements in Hadoop 3.1 include improved Docker support, auto-spawning admin services, and usability enhancements. Future work may include additional runtimes, persistent storage, and inter-service dependencies.
Orion is a petabyte scale AI platform developed by the Big Data and Insights (BDAI) team at Oath to generate actionable insights from large datasets through scalable machine learning. The platform can process over 60 billion records per day from a variety of data sources and uses techniques like anomaly detection and predictive algorithms to provide insights that improve efficiencies, reduce costs, and enhance customer experiences. Orion offers a centralized architecture and suite of APIs to build custom solutions for applications in advertising, marketing, IoT, and other markets at an enterprise scale.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
This document discusses the challenges of operationalizing big data applications and how full stack performance intelligence can help DataOps teams address issues. It describes how intelligence can provide automated diagnosis and remediation to solve problems, automated detection and prevention to be proactive, and automated what-if analysis and planning to prepare for future use. Real-life examples show how intelligence can help with proactively detecting SLA violations, diagnosing Hive/Spark application failures, and planning a migration of applications to the cloud.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
1. Sketch algorithms provide approximate query results with sub-linear space and processing time, enabling analysis of big data that would otherwise require prohibitive resources.
2. Case studies show sketches reduce storage by over 90% and processing time by over 95% compared to exact algorithms, enabling real-time querying and rollups across multiple dimensions that were previously infeasible.
3. The DataSketches library provides open-source implementations of popular sketch algorithms like Theta, HLL, and quantiles sketches, with code samples and adapters for systems like Hive, Pig, and Druid.
2. Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia, The Three Little Pigs
3. Motivation By Example You have web server logs of purchases on your site. You want to find the 10 users who bought the most and the cities they live in. You also want to know what percentage of purchases they account for in those cities. Load Logs Find top 10 users Sum purchases by city Join by city Store top 10 users Calculate percentage Store results
4. In Pig Latin raw = load 'logs' as (name, city, purchase); -- Find top 10 users usrgrp = group raw by (name, city); byusr = foreach usrgrp generate group as k1, SUM(raw.purchase) as utotal; srtusr = order byusr by usrtotal desc; topusrs = limit srtusr 10; store topusrs into 'top_users'; -- Count purchases per city citygrp = group raw by city; bycity = foreach citygrp generate group as k2, SUM(raw.purchase) as ctotal; -- Join top users back to city jnd = join topusrs by k1.city, bycity by k2; pct = foreach jnd generate k1.name, k1.city, utotal/ctotal; store pct into 'top_users_pct_of_city';
7. Where Do Pigs Live? Data Factory Pig Pipelines Iterative Processing Research Data Warehouse BI Tools Analysis Data Collection
8. Pig Highlights Language designed to enable efficient description of data flow Standard relational operators built in User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in: hash, fragment-replicate, merge, skewed Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs
9. Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; group by age, gender store into ‘bydemo’ apply UDFs load users filter nulls group by state store into ‘bystate’ apply UDFs
10. Multi-Store Map-Reduce Plan map filter split local rearrange local rearrange reduce multiplex package package foreach foreach
15. Who uses Pig for What? 70% of production grid jobs at Yahoo (10ks per day) Also used by Twitter, LinkedIn, Ebay, AOL, … Used to Process web logs Build user behavior models Process images Build maps of the web Do research on raw data sets
18. PigServer Java class, a JDBC like interfaceJob executes on cluster Hadoop Cluster Pig resides on user machine User machine No need to install anything extra on your Hadoop cluster.
25. New in 0.8 UDFs can be Jython Improved and expanded statistics Performance Improvements Automatic merging of small files Compression of intermediate results PigUnit for unit testing your Pig Latin scripts Access to static Java functions as UDFs Improved HBase integration Custom PartitionersB = group A by $0 partition by YourPartitioner parallel2; Greatly expanded string and math built in UDFs
26. What’s Next? Preview of Pig 0.9 Integrate Pig with scripting languages for control flow Add macros to Pig Latin Revive ILLUSTRATE Fix runtime type errors Rewrite parser to give more useful error messages Programming Pig from O’Reilly Press
27. Learn More Online documentation: http://pig.apache.org/ Hadoop, The Definitive Guide 2nd edition has an up to date chapter on Pig, search at your favorite bookstore Join the mailing lists: user@pig.apache.org for user questions dev@pig.apache.com for developer issues Follow me on Twitter, @alanfgates
28. UDFs in Scripting Languages Evaluation functions can now be written in scripting languages that compile down to the JVM Reference implementation provided in Jython Jruby, others, could be added with minimal code JavaScript implementation in progress Jython sold separately
29. Example Python UDF test.py: @outputSchema(”sqr:long”) def square(num): return ((num)*(num)) test.pig: register 'test.py' using jython as myfuncs; A = load ‘input’ as (i:int); B = foreach A generate myfuncs.square(i); dump B;
30. Better statistics Statistics printed out at end of job run Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage Loader for reading job history files included in Piggybank New PigRunner interface that allows users to invoke Pig and get back a statistics object that contains stats information Can also pass listener to track Pig jobs as they run Done for Oozie so it can show users Pig statistics
31. Sample stats info Job Stats (time in seconds): JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias job_0 2 1 15 3 9 27 27 27 a,b,c,d,e job_1 1 1 3 3 3 12 12 12 g,h job_2 1 1 3 3 3 12 12 12 i job_3 1 1 3 3 3 12 12 12 i Input(s): Successfully read 10000 records from: “studenttab10k" Successfully read 10000 records from: “votertab10k" Output(s): Successfully stored 6 records (150 bytes) in: ”outfile" Counters: Total records written : 6 Total bytes written : 150
32. Invoke Static Java Functions as UDFs Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs define UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreach A generate UrlDecode(e, 'UTF-8'); Currently only works with simple types and static functions
33. Improved HBase Integration Can now read records as bytes instead of auto converting to strings Filters can be pushed down Can store data in HBase as well as load from it Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.
Editor's Notes
A very common use case we see at Yahoo is users want to read one data set and group it several different ways. Since scan time often dominates for these large data sets, sharing one scan across several group instances can result in nearly linear speed up of queries.
In this case multiple pipelines are needed in Map and Reduce phasesDue to our pull based model in execution, we have split and multiplex embed the pipelines within themselvesRecords are tagged with the pipeline number in the map stageGrouping is done by Hadoop using a union of the keysMultiplex operator on the reducer places incoming records in the correct pipeline
As your website grows, the number of unique users grows beyond what you can keep in memory.A given map only gets input from a given input source. It can therefore annotate tuples from that source with information on which source it came from. The join key is then used to partition the data, but the join key plus the input source id is used to sort it. This allows pig to buffer one side of the join keys in memory and then use that as a probe table as keys from the other input stream by.
Running example: You start a webiste. You want to know how users are using your website. So you collect a couple of streams of information from your logs: page views and users.When you start you have a fair number of page views, but not many users.In this algorithm the smaller table is copied to every map in its entirety (doesn’t yet use Distributed Cache, it should). Larger file is partitioned as per normal MR.
As your website grows even more, some pages become significantly more popular than others. This means that some pages are visited by almost every user, while others are visited only by a few users.First, a sampling pass is done to determine which keys are large enough to need special attention. These are keys that have enough values that we estimate we cannot hold the entire value in memory. It’s about holding the values in memory, not the key.Then at partitioning time, those keys are handled specially. All other keys are treated as in the regular join. These selected keys from input1 are split across multiple reducers. For input2, they are replicated to each of these reducers that had the split. In this way we guarantee that every instance of key k from input1 comes into contact with every instance of k from input2.
Now lets say that for some reason you start keeping both your page view data and user data sorted by user.Note that one way to do this is make sure that pages and users are partitioned the same way. But this leads to a big problem. In order to make sure you can join all your data sets you end up using the same hash function to join them all. But rarely does one bucketing scheme make sense for all your data. Whatever is big enough for one data set will be too small for others, and vice versa. So Pig’s implementation doesn’t depend on how the data is split.Pig does this by sampling one of the inputs and then building an index from that sample that indicates the key for the first record in every split. The other input is used as the standard input file for Hadoop and is split to the maps as per normal. When the map begins processing this file, when it encounters the first key in that file it uses the index to determine where it should open the second, sampled file. It then opens the file at the appropriate point, seeks forward until it finds the key it is looking for, and then begins doing a join on the two data sources.
Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.