This document provides an overview of Hadoop and related Apache projects. It begins with an introduction to Hadoop, explaining why it was created and who uses it. It then discusses HDFS and its goals, architecture, and functions. Next, it covers MapReduce, providing examples and explaining features like locality optimizations. Finally, it briefly introduces related subprojects like Pig, HBase, Hive, and Zookeeper that build upon Hadoop.
This document provides an overview of Hadoop and related Apache projects. It begins with an introduction to Hadoop, describing why it was created and some of the companies that use it. It then discusses the key components of Hadoop, including HDFS for distributed storage, MapReduce for distributed processing, and some related projects like Pig, HBase, Hive, and Zookeeper. For HDFS, it covers the architecture, NameNode, DataNodes, block placement, and other functions. For MapReduce, it explains the programming model, provides a word count example, and discusses tools and optimizations. It concludes by briefly introducing several Hadoop subprojects for different use cases.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It includes HDFS, a distributed file system, and MapReduce, a programming model for large-scale data processing. HDFS stores data reliably across clusters and allows computations to be processed in parallel near the data. The key components are the NameNode, DataNodes, JobTracker and TaskTrackers. HDFS provides high throughput access to application data and is suitable for applications handling large datasets.
The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
This document provides an overview of Hadoop and related Apache projects. It begins with an introduction to Hadoop, describing why it was created and some of the companies that use it. It then discusses the key components of Hadoop, including HDFS for distributed storage, MapReduce for distributed processing, and some related projects like Pig, HBase, Hive, and Zookeeper. For HDFS, it covers the architecture, NameNode, DataNodes, block placement, and other functions. For MapReduce, it explains the programming model, provides a word count example, and discusses tools and optimizations. It concludes by briefly introducing several Hadoop subprojects for different use cases.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It includes HDFS, a distributed file system, and MapReduce, a programming model for large-scale data processing. HDFS stores data reliably across clusters and allows computations to be processed in parallel near the data. The key components are the NameNode, DataNodes, JobTracker and TaskTrackers. HDFS provides high throughput access to application data and is suitable for applications handling large datasets.
The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
This document provides an overview of big data and Hadoop. It discusses what big data is, why it has become important recently, and common use cases. It then describes how Hadoop addresses challenges of processing large datasets by distributing data and computation across clusters. The core Hadoop components of HDFS for storage and MapReduce for processing are explained. Example MapReduce jobs like wordcount are shown. Finally, higher-level tools like Hive and Pig that provide SQL-like interfaces are introduced.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
This document provides an overview of Hadoop fundamentals including:
- Why Hadoop is used for big data applications due to its ability to handle petabytes of data across commodity hardware in a scalable and economical way.
- What Hadoop is and how it provides a distributed storage and processing infrastructure based on Google's papers using HDFS for storage and MapReduce for processing.
- How HDFS stores and replicates blocks of data across nodes to provide fault tolerance and how MapReduce uses a simple programming model of map and reduce functions to distribute processing.
- An example word count application is described to illustrate how MapReduce can be used to count word frequencies by mapping words to counts and then reducing the
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has very large files (over 100 million files) and is optimized for batch processing huge datasets across large clusters (over 10,000 nodes). HDFS stores multiple replicas of data blocks on different nodes to handle failures. It provides high aggregate bandwidth and allows computations to move to where data resides.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
The document provides an overview of Hadoop including what it is, how it works, its architecture and components. Key points include:
- Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models.
- It consists of HDFS for storage and MapReduce for processing via parallel computation using a map and reduce technique.
- HDFS stores data reliably across commodity hardware and MapReduce processes large amounts of data in parallel across nodes in a cluster.
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
The document discusses Hadoop, its components, and how they work together. It covers HDFS, which stores and manages large files across commodity servers; MapReduce, which processes large datasets in parallel; and other tools like Pig and Hive that provide interfaces for Hadoop. Key points are that Hadoop is designed for large datasets and hardware failures, HDFS replicates data for reliability, and MapReduce moves computation instead of data for efficiency.
Dhruba Borthakur presented on Apache Hadoop and Hive. He discussed the architecture of Hadoop Distributed File System (HDFS) and how it is optimized for processing large datasets across commodity hardware. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Hive provides a SQL-like interface to query and analyze large datasets stored in HDFS. Facebook uses a large Hadoop cluster to process petabytes of data daily and many engineers are now using Hadoop and Hive. Borthakur proposed several ideas for collaborations between Hadoop and Condor.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
- HDFS Federation allows Hadoop to scale beyond the limitations of a single namespace by splitting the namespace across multiple independent namenodes. Each namenode manages its own namespace volume consisting of a namespace and block pool.
- A client-side mount table provides a virtual unified namespace by mapping namespace volumes to namenodes, hiding the federation details from users and applications.
- HDFS Federation provides wire compatibility by requiring clients to use the same version of Hadoop as the servers, and supports existing HDFS functionality like append, sticky bits, and new APIs like FileContext.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
Threats in network that can be noted in securityssuserec53e73
Network security threats fall into four main categories: external threats from outside organizations or individuals, internal threats from employees, structured threats from organized cybercriminals, and unstructured attacks from amateurs. Threats are attempts to breach a network, while vulnerabilities are weaknesses in systems that threats can exploit. Common network threats include phishing attacks, ransomware, malware, DDoS attacks, advanced persistent threats, and SQL injection. Organizations can identify threats and vulnerabilities by monitoring their own network, using threat intelligence, conducting penetration testing, managing permissions, using firewalls, and continuously monitoring their network.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
This document provides an overview of big data and Hadoop. It discusses what big data is, why it has become important recently, and common use cases. It then describes how Hadoop addresses challenges of processing large datasets by distributing data and computation across clusters. The core Hadoop components of HDFS for storage and MapReduce for processing are explained. Example MapReduce jobs like wordcount are shown. Finally, higher-level tools like Hive and Pig that provide SQL-like interfaces are introduced.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
This document provides an overview of Hadoop fundamentals including:
- Why Hadoop is used for big data applications due to its ability to handle petabytes of data across commodity hardware in a scalable and economical way.
- What Hadoop is and how it provides a distributed storage and processing infrastructure based on Google's papers using HDFS for storage and MapReduce for processing.
- How HDFS stores and replicates blocks of data across nodes to provide fault tolerance and how MapReduce uses a simple programming model of map and reduce functions to distribute processing.
- An example word count application is described to illustrate how MapReduce can be used to count word frequencies by mapping words to counts and then reducing the
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has very large files (over 100 million files) and is optimized for batch processing huge datasets across large clusters (over 10,000 nodes). HDFS stores multiple replicas of data blocks on different nodes to handle failures. It provides high aggregate bandwidth and allows computations to move to where data resides.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
The document provides an overview of Hadoop including what it is, how it works, its architecture and components. Key points include:
- Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models.
- It consists of HDFS for storage and MapReduce for processing via parallel computation using a map and reduce technique.
- HDFS stores data reliably across commodity hardware and MapReduce processes large amounts of data in parallel across nodes in a cluster.
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
The document discusses Hadoop, its components, and how they work together. It covers HDFS, which stores and manages large files across commodity servers; MapReduce, which processes large datasets in parallel; and other tools like Pig and Hive that provide interfaces for Hadoop. Key points are that Hadoop is designed for large datasets and hardware failures, HDFS replicates data for reliability, and MapReduce moves computation instead of data for efficiency.
Dhruba Borthakur presented on Apache Hadoop and Hive. He discussed the architecture of Hadoop Distributed File System (HDFS) and how it is optimized for processing large datasets across commodity hardware. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Hive provides a SQL-like interface to query and analyze large datasets stored in HDFS. Facebook uses a large Hadoop cluster to process petabytes of data daily and many engineers are now using Hadoop and Hive. Borthakur proposed several ideas for collaborations between Hadoop and Condor.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
- HDFS Federation allows Hadoop to scale beyond the limitations of a single namespace by splitting the namespace across multiple independent namenodes. Each namenode manages its own namespace volume consisting of a namespace and block pool.
- A client-side mount table provides a virtual unified namespace by mapping namespace volumes to namenodes, hiding the federation details from users and applications.
- HDFS Federation provides wire compatibility by requiring clients to use the same version of Hadoop as the servers, and supports existing HDFS functionality like append, sticky bits, and new APIs like FileContext.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
Threats in network that can be noted in securityssuserec53e73
Network security threats fall into four main categories: external threats from outside organizations or individuals, internal threats from employees, structured threats from organized cybercriminals, and unstructured attacks from amateurs. Threats are attempts to breach a network, while vulnerabilities are weaknesses in systems that threats can exploit. Common network threats include phishing attacks, ransomware, malware, DDoS attacks, advanced persistent threats, and SQL injection. Organizations can identify threats and vulnerabilities by monitoring their own network, using threat intelligence, conducting penetration testing, managing permissions, using firewalls, and continuously monitoring their network.
Lsn21_NumPy in data science using pythonssuserec53e73
This document provides an overview of NumPy, a fundamental package for scientific computing in Python. It introduces NumPy's multidimensional array object and its core features like basic linear algebra functions and tools for integrating C/C++ and Fortran code. The document covers how to create and manipulate NumPy arrays using their shape, data types, indexing, slicing, and common methods. It lists NumPy documentation resources and provides examples of creating, indexing, slicing arrays and using NumPy functions. The objective is for students to learn to use the NumPy package.
OpenSSL is an open source toolkit that implements the SSL and TLS protocols for secure network communication as well as cryptography functions. It includes both a command line interface for key generation, encryption, signing and other operations as well as programming APIs for C/C++, Perl, PHP and Python. OpenSSL is used to secure communications with SSL/TLS and provides cryptographic algorithms like AES, DES and RSA. It is commonly used to add SSL support to applications like the Apache web server.
Hash functions, digital signatures and hmacssuserec53e73
This document discusses hash functions, HMACs, and digital signatures. It begins by explaining that while encryption provides confidentiality, it does not prevent message modification. Hash functions map messages to fixed-length values and can detect message tampering, but secure cryptographic hash functions with properties like collision resistance are required. HMACs use a key to authenticate messages hashed with a shared key. Digital signatures use public key cryptography for a sender to authenticate a message by encrypting its hash with their private key.
Asian Elephant Adaptations - Chelsea P..pptxssuserec53e73
The Asian elephant has several adaptations for survival including its massive size, specialized trunks for manipulating objects, tough tusks for defense, floppy ears to regulate temperature, the ability to communicate using infrasound calls, an interesting diet of plants, and resources were found at http://animals.pawnation.com/adaptations-survival-elephants6658.html.
The document provides an introduction to object-oriented programming concepts in Python, including class definitions, objects/instances, methods, inheritance, encapsulation, polymorphism, method overriding, and operator overloading. Key points covered are:
1. A class defines the blueprint for an object, while an instance is a specific object created from a class. Methods are functions defined inside a class.
2. Inheritance allows a child class to inherit attributes and behaviors from a parent class. Encapsulation restricts access to attributes and methods.
3. Polymorphism enables using a common interface for multiple forms or data types. Method overriding allows redefining inherited methods. Operator overloading customizes operator behaviors for user
Here are the key steps to solve this cryptarithmetic puzzle as a constraint satisfaction problem:
1. Define the variables - In this case, the variables are the letters A,E,N,R,S,T. Each variable can take on the values 0-9.
2. Define the constraints - The constraints are that the letters must add up correctly based on the sum, no two letters can have the same value, and M=1 is given.
3. Specify the domain of possible values for each variable. In this case, the domain is 0-9 for each variable.
4. Systematically assign values to the variables while making sure each assignment is consistent with the constraints. Backtrack and
The document discusses planning in artificial intelligence. It introduces the Strips planning framework which uses operators with preconditions, add lists, and delete lists to describe how actions change the world. It describes how Strips can be used to solve a blocks world problem. The document also discusses partial order planning as an alternative to Strips, and represents planning problems using the situation calculus logical framework. It provides an example of representing the monkeys and bananas problem in the situation calculus.
The document discusses different types of knowledge and representations in artificial intelligence, including declarative and procedural knowledge. It provides examples to illustrate the differences between declarative and procedural representations. Specifically, it examines how the order assertions are examined can impact the answer obtained when using a procedural representation compared to a declarative representation. The document also discusses logic programming and the differences between logical assertions and Prolog representations.
Dr. Jose Reena K received a Certificate of Participation for attending a guest lecture titled "Teaching for Tomorrow: Embracing a Research-Oriented Approach" organized by the Internal Quality Assurance Cell of VISTAS on December 18, 2023. The certificate was signed by Dr. N. Kanya of the IQAC, Dr. Malini Pande the Director of IQAC and ASC, and Dr. C.B. Palanivelu the Registrar.
Enumeration involves querying target systems to gather information that can help attackers find vulnerabilities to exploit. It provides details like IP addresses, hostnames, network services, shares, and more. Common enumeration techniques include NetBIOS, SNMP, LDAP, NTP, SMTP, and DNS enumeration, which use tools and protocols like network scanners, SNMP agents, LDAP directories, and cleartext protocols to extract usernames, passwords, device details, email addresses and other data from systems and services. The information gathered through enumeration serves as an initial step for attackers to identify entry points for further attacks.
This document discusses techniques for footprinting and scanning target systems during the early stages of a penetration test or cyber attack. It describes footprinting as gathering open source information on a target through methods like searching online databases and websites. Scanning involves using tools like ping sweeps, port scanning and OS detection to learn more about a target's network configuration and running services. The document provides examples of specific tools attackers can use and also discusses some countermeasures organizations can take to limit the information available through these techniques.
The document discusses concurrency control in database management systems. Concurrency control ensures that transactions are performed concurrently without conflicting results by using methods like locking and timestamps. It prevents issues like lost updates, dirty reads, and non-repeatable reads. The main concurrency control protocols discussed are lock-based protocols using techniques like two-phase locking, and timestamp-based protocols.
The document discusses the key components and technologies that enable the Internet of Things (IoT). It defines IoT as physical objects embedded with sensors and software that connect and exchange data over the internet. The main technologies that make IoT possible are sensors and actuators to interact with the physical world, various connectivity technologies, cloud computing infrastructure to store and analyze vast amounts of data, big data analytics tools to extract insights from data, and security technologies to protect connected devices and data.
This document provides an overview of an IoT reference architecture, describing its functional, information, and deployment views. The functional view outlines several functional groups including device and application, communication, IoT service, virtual entity, process management, service organization, security, and management. The information view describes different patterns for handling information between functional components, such as push, request/response, subscribe/notify, and publish/subscribe. The deployment view focuses on the main real-world components that make up the system.
This document proposes a gait recognition insole to be placed in shoes to monitor the health of elderly individuals. It was submitted by Ms. K. Jose Reena and Dr. R. Parameswari from the Department of Computer Science at Vels Institute of Science, Technology and Advanced Studies. The insole would collect gait data to identify changes that could indicate health issues. The inventors have submitted initial claims and responded to questions from a patent office review. They provide methodology, existing proof of concept work, and plans to acquire more validation data.
Measurement involves comparing an unknown quantity to a known standard unit. There are two types of units - standard and non-standard. Standard units like meters and kilograms are fixed, while non-standard units vary between people and places. The SI system was developed to provide consistent standard units used worldwide. It defines units for length, mass, time, and other physical quantities. Proper measurement requires selecting the appropriate unit and following rules for writing unit symbols. Accuracy is important for scientific work and daily tasks.
Machine learning is the process of using algorithms to analyze data and learn from it. The document discusses several machine learning concepts including supervised and unsupervised learning, decision trees, and inductive learning. It provides examples of using decision trees to classify restaurant customers as waiting or leaving based on attributes like reservation status and restaurant fullness. Key algorithms like ID3 use information gain to build decision trees from training data in a top-down greedy manner.
This document discusses instance-based learning (IBL) methods in machine learning. IBL methods simply store training data and classify new query instances based on similarity to stored instances. The document focuses on k-nearest neighbor learning, which classifies queries based on the classes of the k most similar training instances. It also discusses distance-weighted k-NN, locally weighted regression, radial basis function networks, and case-based reasoning as other IBL methods. The advantages and disadvantages of IBL methods are outlined.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Lecture17 (1).ppt
1. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Cloud Tools Overview
2. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hadoop
3. FEARLESS engineering
Outline
• Hadoop - Basics
• HDFS
– Goals
– Architecture
– Other functions
• MapReduce
– Basics
– Word Count Example
– Handy tools
– Finding shortest path example
• Related Apache sub-projects (Pig, HBase,Hive)
4. FEARLESS engineering
Hadoop - Why ?
• Need to process huge datasets on large
clusters of computers
• Very expensive to build reliability into each
application
• Nodes fail every day
– Failure is expected, rather than exceptional
– The number of nodes in a cluster is not constant
• Need a common infrastructure
– Efficient, reliable, easy to use
– Open Source, Apache Licence
5. FEARLESS engineering
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more
6. FEARLESS engineering
Commodity Hardware
• Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Aggregation switch
Rack switch
7. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hadoop Distributed File System
(HDFS)
Original Slides by
Dhruba Borthakur
Apache Hadoop Project Management Committee
8. FEARLESS engineering
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recover from them
• Optimized for Batch Processing
– Data locations exposed so that computations can
move to where data resides
– Provides very high aggregate bandwidth
9. FEARLESS engineering
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 64MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
11. FEARLESS engineering
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
12. FEARLESS engineering
NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions etc
13. FEARLESS engineering
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to
the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
14. FEARLESS engineering
Block Placement
• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
16. FEARLESS engineering
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
17. FEARLESS engineering
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 bytes
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
18. FEARLESS engineering
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
19. FEARLESS engineering
Data Pieplining
• Client retrieves a list of DataNodes on which
to place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the
next node in the Pipeline
• When all replicas are written, the Client
moves on to write the next block in file
20. FEARLESS engineering
Rebalancer
• Goal: % disk full on DataNodes should be
similar
– Usually run when new DataNodes are added
– Cluster is online when Rebalancer is active
– Rebalancer is throttled to avoid network
congestion
– Command line tool
21. FEARLESS engineering
Secondary NameNode
• Copies FsImage and Transaction Log from
Namenode to a temporary directory
• Merges FSImage and Transaction Log into a
new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
23. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
MapReduce
Original Slides by
Owen O’Malley (Yahoo!)
&
Christophe Bisciglia, Aaron Kimball & Sierra Michells-Slettvet
24. FEARLESS engineering
MapReduce - What?
• MapReduce is a programming model for
efficient distributed computing
• It works like a Unix pipeline
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining
• A good fit for a lot of applications
– Log processing
– Web index building
26. FEARLESS engineering
MapReduce - Features
• Fine grained Map and Reduce tasks
– Improved load balancing
– Faster recovery from failed tasks
• Automatic re-execution on failure
– In a large cluster, some nodes are always slow or flaky
– Framework re-executes failed tasks
• Locality optimizations
– With large data, bandwidth to data is a problem
– Map-Reduce + HDFS is a very effective solution
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled close to the inputs when
possible
27. FEARLESS engineering
Word Count Example
• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster
29. FEARLESS engineering
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
30. FEARLESS engineering
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
31. FEARLESS engineering
Word Count Example
• Jobs are controlled by configuring JobConfs
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is
executed
– conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
– conf.set(“my.string”, “foo”);
– conf.set(“my.integer”, 12);
• JobConf is available to all tasks
32. FEARLESS engineering
Putting it all together
• Create a launching program for your application
• The launching program configures:
– The Mapper and Reducer to use
– The output key and value types (input types are
inferred from the InputFormat)
– The locations for your input and output
• The launching program then submits the job and
typically waits for it to complete
33. FEARLESS engineering
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
34. FEARLESS engineering
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read
by specifying an InputFormat to be used
• A Map/Reduce may specify how it’s output is to be
written by specifying an OutputFormat to be used
• These default to TextInputFormat and
TextOutputFormat, which process line-based text data
• Another common choice is SequenceFileInputFormat
and SequenceFileOutputFormat for binary data
• These are file-based, but they are not required to be
35. FEARLESS engineering
How many Maps and Reduces
• Maps
– Usually as many as the number of HDFS blocks being
processed, this is the default
– Else the number of maps can be specified as a hint
– The number of maps can also be controlled by specifying the
minimum split size
– The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size
• Reduces
– Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
37. FEARLESS engineering
Partitioners
• Partitioners are application code that define how keys
are assigned to reduces
• Default partitioning spreads keys evenly, but randomly
– Uses key.hashCode() % num_reduces
• Custom partitioning is often required, for example, to
produce a total order in the output
– Should implement Partitioner interface
– Set by calling conf.setPartitionerClass(MyPart.class)
– To get a total order, sample the map output keys and pick
values to divide the keys into roughly equal buckets and use
that in your partitioner
38. FEARLESS engineering
Combiners
• When maps produce many repeated keys
– It is often useful to do a local aggregation following the map
– Done by specifying a Combiner
– Goal is to decrease size of the transient data
– Combiners have the same interface as Reduces, and often are the
same class
– Combiners must not side effects, because they run an intermdiate
number of times
– In WordCount, conf.setCombinerClass(Reduce.class);
39. FEARLESS engineering
Compression
• Compressing the outputs and intermediate data will often yield
huge performance gains
– Can be specified via a configuration file or set programmatically
– Set mapred.output.compress to true to compress job output
– Set mapred.compress.map.output to true to compress map outputs
• Compression Types (mapred(.map)?.output.compression.type)
– “block” - Group of keys and values are compressed together
– “record” - Each value is compressed individually
– Block compression is almost always best
• Compression Codecs
(mapred(.map)?.output.compression.codec)
– Default (zlib) - slower, but more compression
– LZO - faster, but less compression
40. FEARLESS engineering
Counters
• Often Map/Reduce applications have countable events
• For example, framework counts records in to and out
of Mapper and Reducer
• To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
• Define nice names in a MyClass_Counter.properties
file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
41. FEARLESS engineering
Speculative execution
• The framework can run multiple instances of slow
tasks
– Output from instance that finishes first is used
– Controlled by the configuration variable
mapred.speculative.execution
– Can dramatically bring in long tails on jobs
42. FEARLESS engineering
Zero Reduces
• Frequently, we only need to run a filter on the input
data
– No sorting or shuffling required by the job
– Set the number of reduces to 0
– Output from maps will go directly to OutputFormat and disk
43. FEARLESS engineering
Distributed File Cache
• Sometimes need read-only copies of data on the local
computer
– Downloading 1GB of data for each Mapper is expensive
• Define list of files you need to download in JobConf
• Files are downloaded once per computer
• Add to launching program:
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”),
conf);
• Add to task:
Path[] files = DistributedCache.getLocalCacheFiles(conf);
44. FEARLESS engineering
Tool
• Handle “standard” Hadoop command line options
– -conf file - load a configuration file named file
– -D prop=value - define a single configuration property prop
• Class looks like:
public class MyApp extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Configuration(),
new MyApp(), args));
}
public int run(String[] args) throws Exception {
…. getConf() ….
}
}
45. FEARLESS engineering
Finding the Shortest Path
• A common graph search
application is finding the
shortest path from a start
node to one or more
target nodes
• Commonly done on a
single machine with
Dijkstra’s Algorithm
• Can we use BFS to find
the shortest path via
MapReduce?
46. FEARLESS engineering
Finding the Shortest Path: Intuition
• We can define the solution to this problem
inductively
– DistanceTo(startNode) = 0
– For all nodes n directly reachable from startNode,
DistanceTo(n) = 1
– For all nodes n reachable from some other set of nodes
S,
DistanceTo(n) = 1 + min(DistanceTo(m), m S)
47. FEARLESS engineering
From Intuition to Algorithm
• A map task receives a node n as a key, and
(D, points-to) as its value
– D is the distance to the node from the start
– points-to is a list of nodes reachable from n
p points-to, emit (p, D+1)
• Reduces task gathers possible distances to a
given p and selects the minimum one
48. FEARLESS engineering
What This Gives Us
• This MapReduce task can advance the known
frontier by one hop
• To perform the whole BFS, a non-MapReduce
component then feeds the output of this step
back into the MapReduce task for another
iteration
– Problem: Where’d the points-to list go?
– Solution: Mapper emits (n, points-to) as well
49. FEARLESS engineering
Blow-up and Termination
• This algorithm starts from one node
• Subsequent iterations include many more
nodes of the graph as the frontier advances
• Does this ever terminate?
– Yes! Eventually, routes between nodes will stop
being discovered and no better distances will be
found. When distance is the same, we stop
– Mapper should emit (n,D) to ensure that “current
distance” is carried into the reducer
50. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hadoop Subprojects
51. FEARLESS engineering
Hadoop Related Subprojects
• Pig
– High-level language for data analysis
• HBase
– Table storage for semi-structured data
• Zookeeper
– Coordinating distributed applications
• Hive
– SQL-like Query language and Metastore
• Mahout
– Machine learning
52. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Pig
Original Slides by
Matei Zaharia
UC Berkeley RAD Lab
53. FEARLESS engineering
Pig
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc.)
– Easy to plug in Java functions
54. FEARLESS engineering
An Example Problem
• Suppose you have
user data in a file,
website data in
another, and you
need to find the top
5 most visited pages
by users aged 18-25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
56. FEARLESS engineering
In Pig Latin
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <=
25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
57. FEARLESS engineering
Ease of Translation
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Fltrd = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
58. FEARLESS engineering
Ease of Translation
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Fltrd = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Job 1
Job 2
Job 3
59. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
HBase
Original Slides by
Tom White
Lexeme Ltd.
60. FEARLESS engineering
HBase - What?
• Modeled on Google’s Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Untyped - stores byte[]
61. FEARLESS engineering
HBase - Data Model
Row Timestamp
Column family:
animal:
Column
family
repairs:
animal:type animal:size repairs:cost
enclosure1
t2 zebra 1000 EUR
t1 lion big
enclosure2 … … … …
62. FEARLESS engineering
HBase - Data Storage
Column family animal:
(enclosure1, t2, animal:type) zebra
(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion
Column family repairs:
(enclosure1, t1, repairs:cost) 1000 EUR
63. FEARLESS engineering
HBase - Code
HTable table = …
Text row = new Text(“enclosure1”);
Text col1 = new Text(“animal:type”);
Text col2 = new Text(“animal:size”);
BatchUpdate update = new BatchUpdate(row);
update.put(col1, “lion”.getBytes(“UTF-8”));
update.put(col2, “big”.getBytes(“UTF-8));
table.commit(update);
update = new BatchUpdate(row);
update.put(col1, “zebra”.getBytes(“UTF-8”));
table.commit(update);
64. FEARLESS engineering
HBase - Querying
• Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
• Retrieve a row
RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
Scanner s = table.getScanner( new String[] { “animal:type” } );
65. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hive
Original Slides by
Matei Zaharia
UC Berkeley RAD Lab
66. FEARLESS engineering
Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
– Maintains list of table schemas
– SQL-like query language (HiveQL)
– Can call Hadoop Streaming scripts from HiveQL
– Supports table partitioning, clustering, complex
data types, some optimizations
67. FEARLESS engineering
Creating a Hive Table
• Partitioning breaks table into separate files for
each (dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
68. FEARLESS engineering
A Simple Query
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND page_views.date <= '2008-03-31'
AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-03-01,*
instead of scanning entire table
• Find all page views coming from xyz.com
on March 31st:
69. FEARLESS engineering
Aggregation and Joins
• Count users who visited each page by gender:
• Sample output:
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2008-03-03';
70. FEARLESS engineering
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_views.userid,
page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
71. UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Storm
Original Slides by
Nathan Marz
Twitter
72. FEARLESS engineering
Storm
• Developed by BackType which was acquired
by Twitter
• Lots of tools for data (i.e. batch) processing
– Hadoop, Pig, HBase, Hive, …
• None of them are realtime systems which is
becoming a real requirement for businesses
• Storm provides realtime computation
– Scalable
– Guarantees no data loss
– Extremely robust and fault-tolerant
– Programming language agnostic
85. FEARLESS engineering
Stream Grouping
• Shuffle grouping: pick a random task
• Fields grouping: consistent hashing on a
subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id