Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. This presentation provides an introduction to Apache DataFu covering both the datafu-pig and datafu-hourglass libraries. This presentation also contains a number of examples of using DataFu to make pig development easier.
Youtube big data analysis using hadoop,pig,hivepankaj chhipa
This document describes analyzing YouTube data using Hadoop and Hive. It discusses extracting video data from YouTube using an API, loading the data into HDFS, then processing the data using Hive queries. Example queries are provided to find the top 5 categories of videos uploaded, the top 10 highest rated videos, and the number of videos uploaded by users under 18 by category. The goal is to analyze large-scale YouTube data for insights using Hadoop's distributed processing capabilities.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
The speaker discusses how they used Terraform to improve their workflow for data science projects. As a data scientist, they spent most of their time dealing with infrastructure issues rather than the data science work. Terraform's "infrastructure as code" approach allowed them to define and provision resources like servers and databases in a declarative way. This improved reproducibility and made it easier to setup and destroy resources for experiments. Modules also helped abstract complexity and allowed resources to be composed together. The speaker argues this approach can benefit both data scientists and devops teams by making infrastructure part of the reproducible workflow.
Good practices for PrestaShop code security and optimizationPrestaShop
The document discusses various optimizations that can be made to improve the performance and security of a PrestaShop installation. It covers optimizations to server infrastructure, database queries, PHP code, and front-end performance. Key recommendations include using caching, minimizing database queries and regular expressions, compressing responses, and securing against common attacks like SQL injection. Measurements are suggested to identify bottlenecks before optimizing.
Telemetry doesn't have to be scary; Ben FordPuppet
This document discusses Puppet telemetry and metrics collection. It introduces Dropsonde, an open source tool for collecting anonymous usage data from Puppet servers. Dropsonde plugins define metrics that are collected and sent to Google BigQuery for analysis. The data is aggregated and made public to help understand Puppet module usage and ecosystem trends, while keeping individual server data private. Users are encouraged to contribute plugins and use the public data for their own analysis and tools.
This document summarizes Joseph Adler's talk on fast lookups in R. It discusses:
1. Looking up values in vectors can take O(n) time, while lookups in environments take O(1) time on average since environments use hash tables.
2. It measures the speed of lookups in vectors versus environments of varying sizes, finding lookups in environments are faster, especially for larger tables.
3. It provides suggestions for optimizing lookup speed such as using positions over names and using environments over vectors for larger tables requiring frequent lookups.
Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Youtube big data analysis using hadoop,pig,hivepankaj chhipa
This document describes analyzing YouTube data using Hadoop and Hive. It discusses extracting video data from YouTube using an API, loading the data into HDFS, then processing the data using Hive queries. Example queries are provided to find the top 5 categories of videos uploaded, the top 10 highest rated videos, and the number of videos uploaded by users under 18 by category. The goal is to analyze large-scale YouTube data for insights using Hadoop's distributed processing capabilities.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
The speaker discusses how they used Terraform to improve their workflow for data science projects. As a data scientist, they spent most of their time dealing with infrastructure issues rather than the data science work. Terraform's "infrastructure as code" approach allowed them to define and provision resources like servers and databases in a declarative way. This improved reproducibility and made it easier to setup and destroy resources for experiments. Modules also helped abstract complexity and allowed resources to be composed together. The speaker argues this approach can benefit both data scientists and devops teams by making infrastructure part of the reproducible workflow.
Good practices for PrestaShop code security and optimizationPrestaShop
The document discusses various optimizations that can be made to improve the performance and security of a PrestaShop installation. It covers optimizations to server infrastructure, database queries, PHP code, and front-end performance. Key recommendations include using caching, minimizing database queries and regular expressions, compressing responses, and securing against common attacks like SQL injection. Measurements are suggested to identify bottlenecks before optimizing.
Telemetry doesn't have to be scary; Ben FordPuppet
This document discusses Puppet telemetry and metrics collection. It introduces Dropsonde, an open source tool for collecting anonymous usage data from Puppet servers. Dropsonde plugins define metrics that are collected and sent to Google BigQuery for analysis. The data is aggregated and made public to help understand Puppet module usage and ecosystem trends, while keeping individual server data private. Users are encouraged to contribute plugins and use the public data for their own analysis and tools.
This document summarizes Joseph Adler's talk on fast lookups in R. It discusses:
1. Looking up values in vectors can take O(n) time, while lookups in environments take O(1) time on average since environments use hash tables.
2. It measures the speed of lookups in vectors versus environments of varying sizes, finding lookups in environments are faster, especially for larger tables.
3. It provides suggestions for optimizing lookup speed such as using positions over names and using environments over vectors for larger tables requiring frequent lookups.
Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm.
Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k
From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/
If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
The document discusses Scalding, an open source Scala-based DSL for Hadoop development. It describes how LinkedIn uses Scalding for various data processing tasks like generating content, targeting, and email experiences. Key benefits of Scalding include its succinct syntax, native Avro support, and ability to integrate with LinkedIn's development tools and processes. The author provides examples of Scalding code used at LinkedIn and discusses opportunities to improve Scalding further.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
The document discusses using a Raspberry Pi, Beacons, and Node.js to broadcast URLs from physical objects. It provides code examples for setting up a Raspberry Pi with Node.js, installing necessary libraries, and programming the device to broadcast URLs using the Eddystone beacon format over Bluetooth Low Energy. The code parses beacon signals to extract URLs and notifies the user of nearby physical web objects.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRIMC Institute
This document outlines an agenda for a hands-on workshop on running Hadoop on public clouds. It describes how to use Cloudera on GoGrid and Amazon EMR to run Hadoop and import/export data. It also covers writing MapReduce programs, packaging and deploying them on Hadoop, and running a sample WordCount job on Amazon EMR.
Eddystone Beacons - Physical Web - Giving a URL to All ObjectsJeff Prestes
More mobile technologies are empowering people and machines to become more autonomous. In the same way as people, machines need ways to be identified to other sources in a connected environment. This begs the question, why not give a URL to objects? With Eddystone, a new Google specification for Beacon data, this is possible, and it works with both Android and iOS based devices.
With it you can implement what physical-web.org stands
This document provides an overview of Apache Pig and Pig Latin for querying large datasets. It discusses why Pig was created due to limitations in SQL for big data, how Pig scripts are written in Pig Latin using a simple syntax, and how PigLatin scripts are compiled into MapReduce jobs and executed on Hadoop clusters. Advanced topics covered include user-defined functions in PigLatin for custom data processing and sharing functions through Piggy Bank.
This document discusses converting Django apps into reusable services. It begins by explaining the typical structure of a Django project with multiple apps. It then discusses some of the challenges with this monolithic structure in terms of reusability, scalability and maintainability.
The document proposes converting apps into reusable services with defined contracts for communication. It provides examples of defining API endpoints and authentication. Converting apps to services allows for improved reusability as apps can be developed and updated independently. It also enables better scalability by removing dependencies between apps.
Cross Domain Web Mashups with JQuery and Google App EngineAndy McKay
This document discusses cross-domain mashups using jQuery and Google App Engine. It describes common techniques for dealing with the same-origin policy, including proxies, JSONP, and building sample applications that mashup Twitter data, geotagged tweets, and maps. Examples include parsing RSS feeds from Twitter into JSONP and displaying tweets on a map based on their geotagged locations. The document concludes by noting issues with trust, failures, and limitations for enterprise use.
The document provides an overview of Hadoop and HDFS. It discusses key concepts such as what big data is, examples of big data, an overview of Hadoop, the core components of HDFS and MapReduce, characteristics of HDFS including fault tolerance and throughput, the roles of the namenode and datanodes, and how data is stored and replicated in blocks in HDFS. It also answers common interview questions about Hadoop and HDFS.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document outlines the schedule and rules for the Thailand Hadoop Big Data Challenge from March 13-15, 2015. It will provide participants access to an Amazon EMR Hadoop cluster to analyze sample or own data sets. The top submission will be judged on complexity, social benefit, innovation, and presentation. The winner receives an Apple TV, and two runners-up receive free training courses. Participants can register on-site or online by March 13 to receive cluster access credentials. Communication will occur via Facebook and email.
This document discusses using Amazon Elastic MapReduce (EMR) as a cost-effective solution for processing large amounts of data for a startup. It describes how the startup was struggling to manage their growing data on MySQL databases and traditional Hadoop clusters, which were difficult to maintain. Amazon EMR allows companies to launch Hadoop clusters on demand to process data stored in S3 and pay only for the hours used, providing significant cost savings compared to maintaining a permanent cluster. The author details their experience migrating to this approach using EMR and how it helped them overcome their scaling challenges in a simple and cost-efficient manner.
This workshop is for a "Big Data using Hadoop course" at IMC Institute in March 2015. The workshop is based on Apache Hadoop and using an EC2 server on AWS.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
This document provides an overview and tutorial on streaming jobs in Hadoop, which allow processing of data using non-Java programs like Python scripts. It includes sample code and datasets to demonstrate joining and counting data from multiple files using mappers and reducers. Tips are provided on optimizing streaming jobs, such as padding fields for sorting, handling errors, and running jobs on Hadoop versus standalone.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Using Python makes Programmers more productive and their programs ultimately better. Python is continued to be a favorite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python runs on Windows, Linux/Unix, Mac OS and has been ported to Java and .NET virtual machines. Python is free to use, even for the commercial products, because of its OSI-approved open source license.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
This document provides an overview of a presentation on how to crack the CFA Level 1 exam. The presentation covers an introduction to the CFA Level 1 exam, who should take it, how to prepare, and a sample class on reading cash flow statements. It is presented by Ankur Kulshrestha and Amit Parakh, who have extensive experience in CFA training and corporate finance. The presentation promotes Edureka's upcoming refresher course for the CFA Level 1 exam, which will provide 50 hours of live online classes over 8 weekends to help students prepare.
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm.
Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k
From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/
If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
The document discusses Scalding, an open source Scala-based DSL for Hadoop development. It describes how LinkedIn uses Scalding for various data processing tasks like generating content, targeting, and email experiences. Key benefits of Scalding include its succinct syntax, native Avro support, and ability to integrate with LinkedIn's development tools and processes. The author provides examples of Scalding code used at LinkedIn and discusses opportunities to improve Scalding further.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
The document discusses using a Raspberry Pi, Beacons, and Node.js to broadcast URLs from physical objects. It provides code examples for setting up a Raspberry Pi with Node.js, installing necessary libraries, and programming the device to broadcast URLs using the Eddystone beacon format over Bluetooth Low Energy. The code parses beacon signals to extract URLs and notifies the user of nearby physical web objects.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRIMC Institute
This document outlines an agenda for a hands-on workshop on running Hadoop on public clouds. It describes how to use Cloudera on GoGrid and Amazon EMR to run Hadoop and import/export data. It also covers writing MapReduce programs, packaging and deploying them on Hadoop, and running a sample WordCount job on Amazon EMR.
Eddystone Beacons - Physical Web - Giving a URL to All ObjectsJeff Prestes
More mobile technologies are empowering people and machines to become more autonomous. In the same way as people, machines need ways to be identified to other sources in a connected environment. This begs the question, why not give a URL to objects? With Eddystone, a new Google specification for Beacon data, this is possible, and it works with both Android and iOS based devices.
With it you can implement what physical-web.org stands
This document provides an overview of Apache Pig and Pig Latin for querying large datasets. It discusses why Pig was created due to limitations in SQL for big data, how Pig scripts are written in Pig Latin using a simple syntax, and how PigLatin scripts are compiled into MapReduce jobs and executed on Hadoop clusters. Advanced topics covered include user-defined functions in PigLatin for custom data processing and sharing functions through Piggy Bank.
This document discusses converting Django apps into reusable services. It begins by explaining the typical structure of a Django project with multiple apps. It then discusses some of the challenges with this monolithic structure in terms of reusability, scalability and maintainability.
The document proposes converting apps into reusable services with defined contracts for communication. It provides examples of defining API endpoints and authentication. Converting apps to services allows for improved reusability as apps can be developed and updated independently. It also enables better scalability by removing dependencies between apps.
Cross Domain Web Mashups with JQuery and Google App EngineAndy McKay
This document discusses cross-domain mashups using jQuery and Google App Engine. It describes common techniques for dealing with the same-origin policy, including proxies, JSONP, and building sample applications that mashup Twitter data, geotagged tweets, and maps. Examples include parsing RSS feeds from Twitter into JSONP and displaying tweets on a map based on their geotagged locations. The document concludes by noting issues with trust, failures, and limitations for enterprise use.
The document provides an overview of Hadoop and HDFS. It discusses key concepts such as what big data is, examples of big data, an overview of Hadoop, the core components of HDFS and MapReduce, characteristics of HDFS including fault tolerance and throughput, the roles of the namenode and datanodes, and how data is stored and replicated in blocks in HDFS. It also answers common interview questions about Hadoop and HDFS.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document outlines the schedule and rules for the Thailand Hadoop Big Data Challenge from March 13-15, 2015. It will provide participants access to an Amazon EMR Hadoop cluster to analyze sample or own data sets. The top submission will be judged on complexity, social benefit, innovation, and presentation. The winner receives an Apple TV, and two runners-up receive free training courses. Participants can register on-site or online by March 13 to receive cluster access credentials. Communication will occur via Facebook and email.
This document discusses using Amazon Elastic MapReduce (EMR) as a cost-effective solution for processing large amounts of data for a startup. It describes how the startup was struggling to manage their growing data on MySQL databases and traditional Hadoop clusters, which were difficult to maintain. Amazon EMR allows companies to launch Hadoop clusters on demand to process data stored in S3 and pay only for the hours used, providing significant cost savings compared to maintaining a permanent cluster. The author details their experience migrating to this approach using EMR and how it helped them overcome their scaling challenges in a simple and cost-efficient manner.
This workshop is for a "Big Data using Hadoop course" at IMC Institute in March 2015. The workshop is based on Apache Hadoop and using an EC2 server on AWS.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
This document provides an overview and tutorial on streaming jobs in Hadoop, which allow processing of data using non-Java programs like Python scripts. It includes sample code and datasets to demonstrate joining and counting data from multiple files using mappers and reducers. Tips are provided on optimizing streaming jobs, such as padding fields for sorting, handling errors, and running jobs on Hadoop versus standalone.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Using Python makes Programmers more productive and their programs ultimately better. Python is continued to be a favorite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python runs on Windows, Linux/Unix, Mac OS and has been ported to Java and .NET virtual machines. Python is free to use, even for the commercial products, because of its OSI-approved open source license.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
This document provides an overview of a presentation on how to crack the CFA Level 1 exam. The presentation covers an introduction to the CFA Level 1 exam, who should take it, how to prepare, and a sample class on reading cash flow statements. It is presented by Ankur Kulshrestha and Amit Parakh, who have extensive experience in CFA training and corporate finance. The presentation promotes Edureka's upcoming refresher course for the CFA Level 1 exam, which will provide 50 hours of live online classes over 8 weekends to help students prepare.
This document discusses transaction processing and control in Informatica PowerCenter. It explains that a transaction is a set of operations that either all succeed or all fail together to maintain data integrity. It then covers transaction control in PowerCenter, including different commit types (source-based, target-based, user-defined), transaction control transformations, and related session properties. The document concludes with an overview of hands-on exercises for configuring commit points and learning mapping tips and tricks.
This document compares OpenStack and Amazon Web Services (AWS) cloud platforms. It discusses the key services each provides such as compute, storage, networking, orchestration, and user interfaces. The document also examines the business characteristics of each like service level agreements (SLAs), data ownership, ecosystems, and pricing models. It provides examples of use cases for each platform and discusses how to manage an open hybrid cloud that utilizes both OpenStack and AWS together through a cloud management platform.
Shawn Hermans gave a presentation on using Pig and Python for big data analysis. He began with an introduction and background about himself. The presentation covered what big data is, challenges in working with large datasets, and how tools like MapReduce, Pig and Python can help address these challenges. As examples, he demonstrated using Pig to analyze US Census data and propagate satellite positions from Two Line Element sets. He also showed how to extend Pig with Python user defined functions.
This document provides an introduction to OpenStack, an open-source cloud operating system. It begins by defining OpenStack as a set of tools for building public or private clouds, then describes its key components like compute, storage, networking, and identity services. The document compares OpenStack to AWS, noting similarities and differences in their virtual machine, storage, networking, security, and orchestration services. It also outlines when OpenStack may be better suited than AWS, such as for organizations with specific hardware requirements. Finally, it briefly introduces DevOps tools like Chef, Puppet, and Ansible that can be used to automate operations on OpenStack clouds.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
Splunk Tutorial for Beginners - What is Splunk | EdurekaEdureka!
The document discusses Splunk, a software platform used for searching, analyzing, and visualizing machine-generated data. It provides an example use case of Domino's Pizza using Splunk to gain insights from data from various systems like mobile orders, website orders, and offline orders. This helped Domino's track the impact of various promotions, compare performance metrics, and analyze factors like payment methods. The document also outlines Splunk's components like forwarders, indexers, and search heads and how they allow users to index, store, search and visualize data.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
What's hot in APIs? Here are 8 big trends in open APIs today. This SXSW 2012 talk covers monetization trends, technology trends and what makes developers love an API (hint: it's not stale documentation). These are drawn from our data and trends we're seeing at ProgrammableWeb.
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/ndqlss ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial:
Hadoop Interview Questions on:
1) Big Data & Hadoop
2) HDFS
3) MapReduce
4) Apache Hive
5) Apache Pig
6) Apache HBase and Sqoop
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
The document discusses logging and analytics for software products. It notes that only logging essential information is important to understand current issues based on past events. Effective logging requires balancing what to log with actually implementing the logging. The document proposes logging all events to a single, standardized format that can be queried across systems to gain insights that are otherwise difficult to obtain from separate logging mechanisms.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include:
1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON
2) Loading the serialized data into Pig for exploration and transformation
3) Publishing the data to a "database" like MongoDB
4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records
The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.
This document provides an overview of Apache Hadoop, including what it is, how it works using MapReduce, and when it may be a good solution. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the parallel processing of large datasets in a reliable, fault-tolerant manner. The document discusses how Hadoop is used by many large companies, how it works based on the MapReduce paradigm, and recommends Hadoop for problems involving big data that can be modeled with MapReduce.
The potential problem with caching in update_homepage is that deleting the cache key after updating the page could lead to a race condition or stampede.
Since the homepage is being hit 1000/sec, between the time the cache key is deleted and a new value is set, many requests could hit the database simultaneously to refetch the page, overwhelming it.
It would be better to set a new value for the cache key instead of deleting it, to avoid this potential issue.
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
The presentation belonging to the ALTEN Playground of november 2, 2017. More information on this playground can be found here: https://www.gitbook.com/book/matthijsmali/user-metrics/
The document provides tips for improving performance at Slack, including:
1. Monitoring performance to track existing and new codepaths and overall application performance.
2. Adding rate limits to APIs and jobs to prevent servers from overloading and spamming.
3. Paginating large lists to return data in chunks and reduce load.
4. Caching frequently accessed and expensive to calculate data locally, remotely, and in databases.
5. Making tasks asynchronous by splitting them into subtasks processed in parallel or via job queues.
6. Having a tiered degradation plan to determine priority of features during outages.
Docker Logging and analysing with Elastic StackJakub Hajek
Collecting logs from the entire stateless environment is challenging parts of the application lifecycle. Correlating business logs with operating system metrics to provide insights is a crucial part of the entire organization. What aspects should be considered while you design your logging solutions?
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
Collecting logs from the entire stateless environment is challenging parts of the application lifecycle. Correlating business logs with operating system metrics to provide insights is a crucial part of the entire organization. We will see the technical presentation on how to manage a large amount of the data in a typical environment with microservices.
The inherent complexity of stream processingnathanmarz
The document discusses approaches for computing unique visitors to web pages over time ranges while dealing with changing user ID mappings.
Initially, three approaches are presented using a key-value store: storing user IDs in sets indexed by URL and hour bucket (Approach 1), using HyperLogLogs for more efficient storage (Approach 2), and storing at multiple granularities to reduce lookups (Approach 3).
The problem is made harder by the presence of "equiv" records that map one user ID to another. Later approaches try to incrementally normalize user IDs, sample user IDs, or maintain separate indexes.
Ultimately, a hybrid approach is proposed using batch computation over the entire dataset to build robust indexes,
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
In many database applications we first log data and then, a few hours or days later, we start analyzing it. But in a world that’s moving faster and faster, we sometimes need to analyze what is happening NOW.
Azure Stream Analytics allows you to analyze streams of data via a new Azure service. In this session you will see how to get started using this new service. From event hubs on the input side over temporal SQL queries: the demo’s in this session will show you end to end how to get started with Azure Stream Analytics.
ElasticSearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud. It is JSON-oriented, uses a RESTful API, and has a schema-free design. Logstash is a tool for collecting, parsing, and storing logs and events in ElasticSearch for later use and analysis. It has many input, filter, and output plugins to collect data from various sources, parse it, and send it to destinations like ElasticSearch. Kibana works with ElasticSearch to visualize and explore stored logs and data.
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen
Dirk Petersen, Scientific Computing Manager, Fred Hutchinson Cancer Research Center (FHCRC)
Joe Arnold, President and Chief Product Officer, SwiftStack
Considering deploying a multi-petabyte storage-as-a-service offering in your research environment? Learn how an industry-leading software-defined object storage solution, architected by SwiftStack and Silicon Mechanics, helped shift hundreds of users to an object-based workflow for their archival data. With an emphasis on cost efficiencies, scalability, and manageability, see how this implementation at Fred Hutchinson Cancer Research Center (FHCRC) is continually evolving across new use cases and access methods.
This document provides an introduction to using Hive on Hadoop at LinkedIn. It begins by discussing the target audience and providing an overview of key Hadoop concepts like HDFS and MapReduce. It then explains that Hive allows SQL-like queries on Hadoop datasets and that most relational algebra can be expressed with maps and reduces. The document outlines LinkedIn's Hive databases and demonstrates basic Hive queries, joins, views, user-defined functions, and best practices. It also provides troubleshooting tips and resources for learning more about Hive.
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...Harry McLaren
We explore "Metrics, mstats and Me: Splunking Human Data” and also have some insights into the KV Store and javascript use in dashboards. We’ll also re-cover the conf18 updates for those who couldn’t attend our last session.
This document provides an overview of key topics for database training for developers, including database design, creating databases and database objects, writing efficient stored procedures, and optimizing queries. It discusses modeling data through entity relationship diagrams, creating tables, views, indexes and stored procedures. It also offers tips for writing well-structured stored procedures and optimizing query performance through execution plans and tuning techniques.
The document provides an overview of Hydra, an open source distributed data processing system. It discusses Hydra's goals of supporting streaming and batch processing at massive scale with fault tolerance. It also covers key Hydra concepts like jobs, tasks, and nodes. The document then demonstrates setting up a local Hydra development environment and creating a sample job to analyze log data and find top search terms.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...kalichargn70th171
A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
3. Apache DataFu
• Apache DataFu is a collection of libraries for working with
large-scale data in Hadoop.
• Currently consists of two libraries:
• DataFu Pig – a collection of Pig UDFs
• DataFu Hourglass – incremental processing
• Incubating
4. History
• LinkedIn had a number of teams who had developed
generally useful UDFs
• Problems:
• No centralized library
• No automated testing
• Solutions:
• Unit tests (PigUnit)
• Code coverage (Cobertura)
• Initially open-sourced 2011; 1.0 September, 2013
5. What it’s all about
• Making it easier to work with large scale data
• Well-documented, well-tested code
• Easy to contribute
• Extensive documentation
• Getting started guide
• i.e. for DataFu Pig – it should be easy to add a UDF,
add a test, ship it
6. DataFu community
• People who use Hadoop for working with data
• Used extensively at LinkedIn
• Included in Cloudera’s CDH
• Included in Apache Bigtop
8. DataFu Pig
• A collection of UDFs for data analysis covering:
• Statistics
• Bag Operations
• Set Operations
• Sessions
• Sampling
• General Utility
• And more..
9. Coalesce
• A common case: replace null values with a default
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;!
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : !
(val2 IS NOT NULL ? val2 : !
(val3 IS NOT NULL ? val3 : !
NULL))) as result; !
• To return the first non-null value
10. Coalesce
• Using Coalesce to set a default of zero
• It returns the first non-null value
data = FOREACH data GENERATE Coalesce(val,0) as result;!
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;!
11. Compute session statistics
• Suppose we have a website, and we want to see how
long members spend browsing it
• We also want to know who are the most engaged
• Raw data is the click stream
pv = LOAD 'pageviews.csv' USING PigStorage(',') !
AS (memberId:int, time:long, url:chararray);!
12. Compute session statistics
• First, what is a session?
• Session = sustained user activity
• Session ends after 10 minutes of no activity
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');!
DEFINE UnixToISO
org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();!
• Session expects ISO-formatted time
13. Compute session statistics
• Sessionize appends a sessionId to each tuple
• All tuples in the same session get the same sessionId
pv_sessionized = FOREACH (GROUP pv BY memberId) {!
ordered = ORDER pv BY isoTime;!
GENERATE FLATTEN(Sessionize(ordered)) !
AS (isoTime, time, memberId, sessionId);!
};!
!
pv_sessionized = FOREACH pv_sessionized GENERATE !
sessionId, memberId, time;!
14. Compute session statistics
• Statistics:
DEFINE Median datafu.pig.stats.StreamingMedian();!
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');!
DEFINE VAR datafu.pig.stats.VAR();!
• You have your choice between streaming (approximate)
and exact calculations (slower, require sorted input)
15. Compute session statistics
• Computer the session length in minutes
session_times = !
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))!
GENERATE group.sessionId as sessionId,!
group.memberId as memberId,!
(MAX(pv_sessionized.time) - !
MIN(pv_sessionized.time))!
/ 1000.0 / 60.0 as session_length;!
16. Compute session statistics
• Compute the statistics
session_stats = FOREACH (GROUP session_times ALL) {!
GENERATE!
AVG(ordered.session_length) as avg_session,!
SQRT(VAR(ordered.session_length)) as std_dev_session,!
Median(ordered.session_length) as median_session,!
Quantile(ordered.session_length) as quantiles_session;!
};!
17. Compute session statistics
• Find the most engaged users
long_sessions = !
filter session_times by !
session_length > !
session_stats.quantiles_session.quantile_0_95;!
!
very_engaged_users = !
DISTINCT (FOREACH long_sessions GENERATE memberId);!
18. Pig Bags
• Pig represents collections as a bag
• In PigLatin, the ways in which you can manipulate a bag
are limited
• Working with an inner bag (inside a nested block) can be
difficult
19. DataFu Pig Bags
• DataFu provides a number of operations to let you
transform bags
• AppendToBag – add a tuple to the end of a bag
• PrependToBag – add a tuple to the front of a bag
• BagConcat – combine two (or more) bags into one
• BagSplit – split one bag into multiples
20. DataFu Pig Bags
• It also provides UDFs that let you operate on bags similar
to how you might with relations
• BagGroup – group operation on a bag
• CountEach – count how many times a tuple appears
• BagLeftOuterJoin – join tuples in bags by key
21. Counting Events
• Let’s consider a system where a user is recommended
items of certain categories and can act to accept or reject
these recommendations
impressions = LOAD '$impressions' AS (user_id:int, item_id:int,
timestamp:long);!
accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);!
rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
22. Counting Events
• We want to know, for each user, how many times an item
was shown, accepted and rejected
features: {!
user_id:int, !
items:{(!
item_id:int,
impression_count:int, !
accept_count:int, !
reject_count:int)}
}!
23. Counting Events
-- First cogroup!
features_grouped = COGROUP !
impressions BY (user_id, item_id),
accepts BY (user_id, item_id), !
rejects BY (user_id, item_id);!
-- Then count!
features_counted = FOREACH features_grouped GENERATE !
FLATTEN(group) as (user_id, item_id),!
COUNT_STAR(impressions) as impression_count,!
COUNT_STAR(accepts) as accept_count,!
COUNT_STAR(rejects) as reject_count;!
-- Then group again!
features = FOREACH (GROUP features_counted BY user_id) GENERATE !
group as user_id,!
features_counted.(item_id, impression_count, accept_count, reject_count) !
as items;!
One approach…
24. Counting Events
• But it seems wasteful to have to group twice
• Even big data can get reasonably small once you start
slicing and dicing it
• Want to consider one user at a time – that should be small
enough to fit into memory
25. Counting Events
• Another approach: Only group once
• Bag manipulation UDFs to avoid the extra mapreduce job
DEFINE CountEach datafu.pig.bags.CountEach('flatten');!
DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();!
DEFINE Coalesce datafu.pig.util.Coalesce();!
• CountEach – counts how many times a tuple appears in a
bag
• BagLeftOuterJoin – performs a left outer join across
multiple bags
26. Counting Events
features_grouped = COGROUP impressions BY user_id, accepts BY user_id, !
rejects BY user_id;!
!
features_counted = FOREACH features_grouped GENERATE !
group as user_id,!
CountEach(impressions.item_id) as impressions,!
CountEach(accepts.item_id) as accepts,!
CountEach(rejects.item_id) as rejects;!
!
features_joined = FOREACH features_counted GENERATE!
user_id,!
BagLeftOuterJoin(!
impressions, 'item_id',!
accepts, 'item_id',!
rejects, 'item_id'!
) as items;!
A DataFu approach…
27. Counting Events
features = FOREACH features_joined {!
projected = FOREACH items GENERATE!
impressions::item_id as item_id,!
impressions::count as impression_count,!
Coalesce(accepts::count, 0) as accept_count,!
Coalesce(rejects::count, 0) as reject_count;!
GENERATE user_id, projected as items;!
}!
• Revisit Coalesce to give default values
28. Sampling
• Suppose we only wanted to run our script on a sample of
the previous input data
• We have a problem, because the cogroup is only going to
work if we have the same key (user_id) in each relation
impressions = LOAD '$impressions' AS (user_id:int, item_id:int,
item_category:int, timestamp:long);!
accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);!
rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
29. Sampling
• DataFu provides SampleByKey
DEFINE SampleByKey datafu.pig.sampling.SampleByKey(’a_salt','0.01');!
!
impressions = FILTER impressions BY SampleByKey('user_id');!
accepts = FILTER impressions BY SampleByKey('user_id');!
rejects = FILTER rejects BY SampleByKey('user_id');!
features = FILTER features BY SampleByKey('user_id');!
30. Left outer joins
• Suppose we had three relations:
• And we wanted to do a left outer join on all three:
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);!
input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);!
input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);!
joined = JOIN input1 BY key LEFT, !
input2 BY key,!
input3 BY key;!
!
• Unfortunately, this is not legal PigLatin
31. Left outer joins
• Instead, you need to join twice:
data1 = JOIN input1 BY key LEFT, input2 BY key;!
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;!
• This approach requires two MapReduce jobs, making it
inefficient, as well as inelegant
32. Left outer joins
• There is always cogroup:
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;!
data2 = FOREACH data1 GENERATE!
FLATTEN(input1), -- left join on this!
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) !
as (input2::key,input2::val),!
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) !
as (input3::key,input3::val);!
• But, it’s cumbersome and error-prone
33. Left outer joins
• So, we have EmptyBagToNullFields
• Cleaner, easier to use
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;!
data2 = FOREACH data1 GENERATE!
FLATTEN(input1), -- left join on this!
FLATTEN(EmptyBagToNullFields(input2)),!
FLATTEN(EmptyBagToNullFields(input3));!
34. Left outer joins
• Can turn it into a macro
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3)
returns joined {!
cogrouped = COGROUP !
$relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;!
$joined = FOREACH cogrouped GENERATE !
FLATTEN($relation1), !
FLATTEN(EmptyBagToNullFields($relation2)), !
FLATTEN(EmptyBagToNullFields($relation3));!
}!
features = left_outer_join(input1, val1, input2, val2, input3, val3);!
35. Schema and aliases
• A common (bad) practice in Pig is to use positional notation
to reference fields
• Hard to maintain
• Script is tightly coupled to order of fields in input
• Inserting a field in the beginning breaks things
downstream
• UDFs can have this same problem
• Especially problematic because code is separated, so
the dependency is not obvious
36. Schema and aliases
• Suppose we are calculating monthly mortgage payments
for various interest rates
mortgage = load 'mortgage.csv' using PigStorage('|')!
as (principal:double,!
num_payments:int,!
interest_rates: bag {tuple(interest_rate:double)});!
37. Schema and aliases
• So, we write a UDF to compute the payments
• First, we need to get the input parameters:
@Override!
public DataBag exec(Tuple input) throws IOException!
{!
Double principal = (Double)input.get(0);!
Integer numPayments = (Integer)input.get(1);!
DataBag interestRates = (DataBag)input.get(2);!
!
// ...!
38. Schema and aliases
• Then do some computation:
DataBag output = bagFactory.newDefaultBag();!
!
for (Tuple interestTuple : interestRates) {!
Double interest = (Double)interestTuple.get(0);!
!
double monthlyPayment = computeMonthlyPayment(principal, numPayments, !
interest);!
!
output.add(tupleFactory.newTuple(monthlyPayment));!
}!
39. Schema and aliases
• The UDF then gets applied
payments = FOREACH mortgage GENERATE MortgagePayment($0,$1,$2);!
payments = FOREACH mortgage GENERATE !
MortgagePayment(principal,num_payments,interest_rates);!
• Or, a bit more understandably
40. Schema and aliases
• Later, the data is changes, and a field is prepended to
tuples in the interest_rates bag
mortgage = load 'mortgage.csv' using PigStorage('|')!
as (principal:double,!
num_payments:int,!
interest_rates: bag {tuple(wow_change:double,interest_rate:double)});!
• The script happily continues to work, and the output data
begins to flow downstream, causing serious errors, later
41. Schema and aliases
• Write the UDF to fetch arguments by name using the
schema
• AliasableEvalFunc can help
Double principal = getDouble(input,"principal");!
Integer numPayments = getInteger(input,"num_payments");!
DataBag interestRates = getBag(input,"interest_rates");!
for (Tuple interestTuple : interestRates) {!
Double interest = getDouble(interestTuple, !
getPrefixedAliasName("interest_rates”, "interest_rate"));!
// compute monthly payment...!
}!
42. Other awesome things
• New and coming things
• Functions for calculating entropy
• OpenNLP wrappers
• New and improved Sampling UDFs
• Additional Bag UDFs
• InHashSet
• More…
44. Event Collection
• Typically online websites have instrumented
services that collect events
• Events stored in an offline system (such as
Hadoop) for later analysis
• Using events, can build dashboards with
metrics such as:
• # of page views over last month
• # of active users over last month
• Metrics derived from events can also be useful
in recommendation pipelines
• e.g. impression discounting
45. Event Storage
• Events can be categorized into topics, for example:
• page view
• user login
• ad impression/click
• Store events by topic and by day:
• /data/page_view/daily/2013/10/08
• /data/page_view/daily/2013/10/09
• Hourglass allows you to perform computation over specific time
windows of data stored in this format
46. Computation Over Time Windows
• In practice, many of computations over time
windows use either:
47. Recognizing Inefficiencies
• But, frequently jobs re-compute these daily
• From one day to next, input changes little
• Fixed-start window
includes one new day:
49. Improving Fixed-Start Computations
• Suppose we must compute page view counts per member
• The job consumes all days of available input, producing one
output.
• We call this a partition-collapsing job.
• But, if the job runs tomorrow it has
to reprocess the same data.
50. Improving Fixed-Start Computations
• Solution: Merge new data with previous output
• We can do this because this is an arithmetic operation
• Hourglass provides a
partition-collapsing job that
supports output reuse.
52. Improving Fixed-Length Computations
• For a fixed-length job, can reuse output using a
similar trick:
• Add new day to previous
output
• Subtract old day from result
• We can subtract the old day
since this is arithmetic
54. Improving Fixed-Length Computations
• But, for some operations, cannot subtract old data
• example: max(), min()
• Cannot reuse previous output, so how to reduce computation?
• Solution: partition-preserving job
• Partitioned input data,
• partitioned output data
• aggregate the data in advance
56. MapReduce in Hourglass
• MapReduce is a fairly general programming model
• Hourglass requires:
• reduce() must output (key,value) pair
• reduce() must produce at most one value
• reduce() implemented by an accumulator
• Hourglass provides all the MapReduce boilerplate for
you for these types of jobs
57. Summary
• Two types of jobs:
• Partition-preserving: consume partitioned
input data, produce partitioned output data
• Partition-collapsing: consume partitioned
input data, produce single output
58. Summary
• You provide:
• Input: time range, input paths
• Implement: map(), accumulate()
• Optional: merge(), unmerge()
• Hourglass provides the rest to make it easier to
implement jobs that incrementally process