Abhishek Roy will teach a master class on Big Data and Hadoop. The class will cover what Big Data is, the history and background of Hadoop, how to set up and use Hadoop, and tools like HDFS, MapReduce, Pig, Hive, Mahout, Sqoop, Flume, Hue, Zookeeper and Impala. The class will also discuss real world use cases and the growing market for Big Data tools and skills.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
Most organizations still rely on batch and offline processing of data streams to gain meaningful analysis and insight into their business. However, in our instant gratification world, real-time computation and analysis of streaming data is crucial in gaining insight into patterns and threats. A trend is emerging for real-time and instant analysis from live data streams, promoting the value of logs and a move toward functional programming.
This shift in technology is not about what and how to store the data, but what we can do with it to see emerging patterns and trends across multiple resources, applications, services and environments. Log data represents a wealth of information, yet is often sporadic, unstructured, scattered across the enterprise and difficult to track.
These slides provide insights into some of the most helpful Big Data tools used by the largest social media and data-centric organizations for competitive trends, instant analysis and feedback from large volume data streams. We show how how using Big Data tools Storm, ElasticSearch and an elastic UI can turn application logs into real-time analytical views.
You will also learn how Big Data:
Contains data that is elastic, minimally structured, flexible and scalable
Helps process live streams into meaningful data
Promotes a move toward functional programming
Effects the enterprise data architecture
Works with real-time CEP tools like Storm for functional programming
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
Cloud Computing Evolution
Why Cloud Computing needed?
Cloud Computing Models
Cloud Solutions
Cloud Jobs opportunities
Criteria for Big Data
Big Data challenges
Technologies to process Big Data- Hadoop
Hadoop History and Architecture
Hadoop Eco-System
Hadoop Real-time Use cases
Hadoop Job opportunities
Hadoop and SAP HANA integration
Summary
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
A brief intro on the idea of what is Big Data and it's potential. This is primarily a basic study & I have quoted the source of infographics, stats & text at the end. If I have missed any reference due to human error & you recognize another source, please mention.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. About Me
•
•
•
•
•
Abhishek Roy
Around 10 years of tech experience with services,
product and startups
Was leading the data team at Qyuki Digital Media
Lately involved in Big data with emphasis on
recommendation systems and machine learning
approaches
Currently working on building an employee wellness
platform, www.feetapart.com
Linkedin : http://www.linkedin.com/in/abhishekroy8
3. Agenda
•
•
•
•
•
•
•
•
What is Big Data
History and Background
Why Data
Intro to Hadoop
Intro to HDFS
Setup & some hands on
HDFS architecture
Intro to Map Reduce
4. Agenda
• Pig
• Hive
• Apache Mahout
– Running an example MR Job
• Sqoop, Flume, Hue, Zookeeper
• Impala
• Some Real World Use cases
5. What Is Big Data?
Like the term Cloud, it is a bit illdefined/hazy
7. What’s the “BIG” hype about Big Data?
There may be hype, but problems are real and of big
value. How?
• We are in the age of advanced analytics (that’s
where all the problem is ,we want to analyze the
data) where valuable business insight is mined out
of historical data.
• We live in the age of crazy data where every
individual , enterprise and machine leaves so much
data behind summing up to many terabytes &
petabytes , and it is only expected to grow.
8. What’s the “BIG” hype about Big Data?
• Good news. Blessing in disguise. More data
means more precision.
• More data usually beats better algorithms.
• But how are we going to analyze?
• Traditional database or warehouse systems
crawl or crack at these volumes.
• Inflexible to handle most of these formats.
• This is the very characteristic of Big Data.
9. Key Hadoop/Big Data Data Sources
•
•
•
•
•
•
•
•
•
Sentiment
Clickstream
Sensor/Machine
Geographic
Text
Chatter from social networks
Traffic flow sensors
Satellite imagery
Broadcast audio streams
10. Sources of Big Data
•
•
•
•
•
•
•
Banking transactions
MP3s of rock music
Scans of government documents
GPS trails
Telemetry from automobiles
Financial market data
….
11. Key Drivers
Spread of cloud computing, mobile
computing and social media
technologies, financial transactions
12. Introduction cont.
• Nature of Big Data
– Huge volumes of data that cannot be handled by
traditional database or warehouse systems.
– It is mostly machine produced.
– Mostly unstructured and grows at high velocity.
– Big data doesn’t always mean huge data, it means
“difficult” data.
16. Inflection Points
• Is divide the data and rule a solution here?
– Have a multiple disk drive , split your data file into
small enough pieces across the drives and do
parallel reads and processing.
– Hardware Reliability (Failure of any drive) is a
challenge.
– Resolving data interdependency between drives is
a notorious challenge.
– Number of disk drives that can be added to a
server is limited
17. Inflection Points
• Analysis
– Much of big data is unstructured. Traditional
RDBMS/EDW cannot handle it.
– Lots of Big Data analysis is adhoc in nature, involves
whole data scan, referencing itself, joining, combing
etc.
– Traditional RDBMS/EDW cannot handle these with
their limited scalability options and architectural
limitations.
– You can incorporate betters servers, processors and
throw in more RAM but there is a limit to it.
18. Inflection Points
• We need a Drastically different approach
– A distributed file system with high capacity and
high reliability.
– A process engine that can handle structure
/unstructured data.
– A computation model that can operate on
distributed data and abstracts data dispersion.
20. What is Hadoop?
• “framework for running [distributed]
applications on large cluster built of
commodity hardware” –from Hadoop Wiki
• Originally created by Doug Cutting
– Named the project after his son’s toy
• The name “Hadoop” has now evolved to cover
a family of products, but at its core, it’s
essentially just the MapReduce programming
paradigm + a distributed file system
24. Pig
A platform for analyzing large data sets that
consists of a high-level language for
expressing data analysis programs, coupled
with infrastructure for evaluating these
programs.
25. Mahout
A machine learning library with algorithms
for clustering, classification and batch
based collaborative filtering that are
implemented on top of Apache Hadoop.
26. Hive
Data warehouse software built on top of
Apache Hadoop that facilitates querying
and managing large datasets residing in
distributed storage.
27. Sqoop
A tool designed for efficiently transferring
bulk data between Apache Hadoop and
structured data stores such as relational
databases.
28. Apache Flume
A distributed service for collecting,
aggregating, and moving large log data
amounts to HDFS.
29. Twitter Storm
Storm can be used to process a
stream of new data and update
databases in real time.
30. Funding & IPO
• Cloudera, (Commerical Hadoop) more than
$75 million
• MapR (Cloudera competitor) has raised more
than $25 million
• 10Gen (Maker of the MongoDB) $32 million
• DataStax (Products based on Apache
Cassandra) $11 million
• Splunk raised about $230 million through IPO
31.
32. Big Data Application Domains
•
•
•
•
•
•
Healthcare
The public sector
Retail
Manufacturing
Personal-location data
Finance
33. Use The Right Tool For The Right Job
Relational Databases:
When to use?
Hadoop:
When to use?
•
Interactive Reporting (<1sec)
•
Affordable Storage/Compute
•
Multistep Transactions
•
Structured or Not (Agility)
•
Lots of Inserts/Updates/Deletes
•
Resilient Auto Scalability
34. Ship the Function to the Data
Traditional Architecture
function
Distributed Computing
function
data
data
data
data
function
function
function
data
data
function
data
RDBMS
function
data
data
data
data
data
data
data
data
function
function
function
data
data
data
data
data
data
data
data
data
function
function
function
data
data
data
SAN/NAS
35. Economics of Hadoop Storage
• Typical Hardware:
–
–
–
–
Two Quad Core Nehalems
24GB RAM
12 * 1TB SATA disks (JBOD mode, no need for RAID)
1 Gigabit Ethernet card
• Cost/node: $5K/node
• Effective HDFS Space:
– ¼ reserved for temp shuffle space, which leaves 9TB/node
– 3 way replication leads to 3TB effective HDFS space/node
– But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TB
Other solutions cost in the range of $5K to $100K per user
TB
39. Market for big data tools will rise
from $9 billion to $86 billion in 2020
40. Future of Big Data
• More Powerful and Expressive Tools for Analysis
• Streaming Data Processing (Storm from Twitter and S4 from
Yahoo)
• Rise of Data Market Places (InfoChimps, Azure
Marketplace)
• Development of Data Science Workflows and Tools (Chorus,
The Guardian, New York Times)
• Increased Understanding of Analysis and Visualization
http://www.evolven.com/blog/big-data-predictions.html
44. Market for big data tools will rise
from $9 billion to $86 billion in 2020
45. Typical Hadoop Architecture
Business Users
End Customers
Business Intelligence
OLAP Data Mart
Interactive Application
OLTP Data Store
Engineers
Hadoop: Storage and Batch Processing
Data Collection
46. HDFS Introduction
• Written in Java
• Optimized for larger files
– Focus on streaming data
(high-throughput > low-latency)
• Rack-aware
• Only *nix for production env.
• Web consoles for stats
49. MapReduce basics
• Take a large problem and divide it into subproblems
• Perform the same function on all subproblems
• Combine the output from all sub-problems
50. MapReduce(M/R) facts
• M/R is excellent for problems where the “subproblems” are not interdependent
• For example, the output of one “mapper” should
not depend on the output or communication with
another “mapper”
• The reduce phase does not begin execution until
all mappers have finished
• Failed map and reduce tasks get auto-restarted
• Rack/HDFS-aware
52. What is the MapReduce Model
• MapReduce is a computation model that supports parallel
processing on distributed data using clusters of computers.
• The MapReduce model expects the input data to be split
and distributed to the machines on the clusters so that
each splits cab be processed independently and in parallel.
• There are two stages of processing in MapReduce model to
achieve the final result: Map and Reduce. Every machine
in the cluster can run independent map and reduce
processes.
53. What is MapReduce Model
• Map phase processes the input splits. The output of the Map phase
is distributed again to reduce processes to combine the map output
to give final expected result.
• The model treats data at every stage as a key and value pair,
transforming one set of key/value pairs into different set of
key/value pairs to arrive at the end result.
• Map process tranforms input key/values pairs to a set of
intermediate key/vaue pairs.
• MapReduce framework passes this output to reduce processes
which will transform this to get final result which again will be in the
form of key/value pairs.
54. Design of MapReduce -Daemons
The MapReduce system is managed by two daemons
• JobTracker & TaskTracker
• JobTracker TaskTracker function in master / slave fashion
– JobTracker coordinates the entire job execution
– TaskTracker runs the individual tasks of map and
reduce
– JobTracker does the bookkeeping of all the tasks run on
the cluster
– One map task is created for each input split
– Number of reduce tasks is configurable
(mapred.reduce.tasks)
56. Who Loves it
• Yahoo !! Runs 20,000 servers running Hadoop
• Largest Hadoop clusters is 4000 servers,16PB
raw storage
• Facebook runs 2000 Hadoop servers
• 24 PB raw storage and 100 TB raw log/day
• eBay and LinkedIn has production use of
Hadoop.
• Sears retails uses Hadoop.
58. Hadoop Requirements
• Supported Platforms
• GNU/Linux is supported as development and
production
• Required Software
•
•
java 1.6.x +
ssh to be installed ,sshd must be running (for machined in
the cluster to interact with master machines)
• Development Environment
• Eclipse 3.6 or above
59. Lab Requirements
• Window 7-64 bit OS , Min 4 GB Ram
• VMWare Player5.0.0
• Linux VM-Ubuntu 12.04 LTS
– User: hadoop, password :hadoop123
• Java 6 installed on Linux VM
• Open SSH installed on Linux VM
• Putty-For opening Telnet sessions to the Linux VM
• WinSCP- For transferring files between windows and
Linux VM
• Other Linux machines will do as well
• Eclipse 3.6
62. Starting VM
• Enter userID/password
• Type ifconfig
– Note down the ip address
– Connect to the VM using Putty
63. Install and Configure ssh(non-VM
users)
• Install ssh
sudo apt –get install ssh
– Check ssh installation: which ssh
which sshd
which ssh-keygen
– Generate ssh Key
ssh-keygen –t rsa –P ” –f ~/.ssh/id_rsa
– Copy public key as an authorized key(equivalent to
slave node)
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
64. Install and Configure ssh
• Verify SSHby logging into target(localhost
here)
– Command:
ssh localhost
– (this command will log you into the machine
localhost)
65. Accessing VM Putty and WinSCP
• Get IP address of the VM by using command
in Linus VM
• Use Putty to telnet to VM
• Use WinSCP to FTP files to VM
67. Lab-VM Directory Structure(non-VM
users)
• User Home directory for user “hadoop”(created default
by OS)
– /home/hadoop
• Create working directory for the lab session
– /home/hadoop/lab
• Create directory for storing all downloads(installables)
– /home/hadoop/lab/downloads
• Create directory for storing data for analysis
– /home/hadoop/lab/data
• Create directory for installing tools
– /home/hadoop/lab/install
68. Install and Configure Java(non-VM
users)
• Install Open JDK
– Command :
– sudo apt-get install openjdk-6-jdk
– Check installation :java -version
• Configure java home in environment
– Add a line to .bash_profile to set java Home
• export JAVA_HOME =/usr/lib/jvm/java-6-openjdkamd64
– Hadoop will use during runtime
69. Install Hadoop(non-VM users)
•
Download hadoop jar with wget
– http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz
•
Untar
– cd ~/lab/install
– tar xzf ~/lab/downloads/hadoop-1.03.tar.gz
– Check extracted directory “hadoop-1.0.3”
•
Configure environment in .bash_profile ( or .bashrc)
– Add below two lines and execute bash profile
• export HADOOP_INSTALL=~/lab/install/hadoop-1.0.3
• export PATH=$PATH:$HADOOP_INSTALL/bin
• . .bash_profile (Execute bashrc)
•
Check Hadoop installation
– hadoop version
70. Setting up Hadoop
• Open $HADOOP_HOME/conf/hadoop-env.sh
• set the JAVA_HOME environment variable to
the $JAVA_HOME directory.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdkamd64
72. Component of core Hadoop
•
At a high-level Hadoop architecture components can be classified into two
categories
– Distributed File management system-HDFS
•
This has central and distributed sub components
•
•
DataNode: Take care of the local file segments and constantly communicates with NameNode
•
•
•
NameNode: Centrally Monitors and controls the whole file system
Secondary NameNode:Do not confuse. This is not a NameNode Backup. This just backs up the file
system status from the NameNode periodically.
Distributed computing system-MapReduce Framework
This again has central and distributed sub components
•
•
•
Job tracker: Centrally monitors the submitted Job and controls all processes running on the
nodes(computers) of the clusters.This communicated with Name Node for file system access
Task Tracker: Take care of the local job execution on the local file segments. This talks to DataNode for
file information. This constantly communicates with job Tracker daemon to report the task progress.
When the Hadoop system is running in a distributed mode all the daemons would
be running in the respective computer.
73. Hadoop Operational Modes
Hadoop can be run in one of the three modes
• Standalone(Local) Mode
– No daemons launched
– Everything runs in single JVM
– Suitable for development
•
Pseudo Distributed Mode
– All daemons are launched
– Runs on a single machine thus simulating a cluster environment
– Suitable for testing and debugging
•
Fully Distributed Mode
– The Hadoop daemons runs in a cluster environment
– Each daemons run on machine respectively assigned to them
– Suitable for integration Testing/Production
74. Hadoop Configuration Files
Filename
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
masters
slaves
hadoopmetrics.properties
log4j.properties
Format
Description
Bash script
Environment variables that are used in the scripts to run Hadoop
Hadoop
configuration.X Configuration settings for Hadoop Core, such as I/O settings that are
ML
common to HDFS and MapReduce
Hadoop
configuration.X Configuration settings for HDFS daemons: the namenode, the secondary
ML
namenode, and the datanodes
Hadoop
configuration.X Configuration settings for MapReduce daemons: the jobtracker, and the
ML
tasktrackers
Plain text
A list of machines (one per line) that each run a secondary namenode
Plain text
Java
Properties
Java
Properties
A list of machines (one per line) that each run a datanode and a tasktracker
Properties for controlling how metrics are published in Hadoop
Properties for system logfiles, the namenode audit log, and the task log for
the tasktracker child process
77. Design of HDFS
•
•
•
•
•
HDFS is a Hadoop's Distributed File System
Designed for storing very large files(Petabytes)
Single file can be stored across several disk
Not suitable for low latency data access
Designed to be highly fault tolerant hence can
run on commodity hardware
78. HDFS Concepts
• Like any file system HDFS stores files by
breaking it into smallest units called Blocks.
• The default HDFS block size is 64 MB
• The large block size helps in maintaining high
throughput
• Each block is replicated across multiple
machine in the cluster for redundancy.
79. Design of HDFS -Daemons
• The HDFS file system is managed by two daemons
• NameNode and DataNode
• NameNode and DataNode function in
master/slave fashion
– NameNode Manages File system namespace
– Maintains file system and metadata of all the files and
directories
• Namespace image
• Edit log
80. Design of HDFS –Daemons (cont.)
• Datanodes store and retrieve the blocks for the files when
they are told by NameNode.
• NameNode maintains the information on which DataNodes
all the blocks for a given files are located.
• DataNodes report to NameNode periodically with the list of
blocks they are storing.
• With NameNode off ,the HDFS is inaccessible.
• Secondary NameNode
– Not a backup for NameNode
– Just helps in merging namespace image with edit log to avoid
edit log becoming too large
86. Configuring conf/*-site.xml files
• Need to set the hadoop.tmp.dir parameter
to a directory of your choice.
• We will use /app/hadoop/tmp
• sudo mkdir -p /app/hadoop/tmp
• sudo chmod 777 /app/hadoop/tmp
87. Configuring HDFS:
core-site.xml(Pseudo Distributed
Mode)
<?xml version ="1.0"?>
<!--core-site.xml-->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
Note: Add "fs.default.name" property under configuration tag to specify NameNode
location.
"localhost" for pseudo distributed mode. Name node runs at port 8020 by default if
no port specified.
88. Configuring mapred-site.xml(Pseudo Distributed Mode)
<?xml version="1.0"?>
<!— mapred-site.xml —>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Note:Add "mapred.job.tracker" property under configuration tag to specify
JobTracker location.
"localhost:8021" for Pseudo distributed mode.
Lastly set JAVA_HOME in conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
90. Starting HDFS
•
•
•
Format NameNode
hadoop namenode -format
– creates empty file system with storage directories and persistent data structure
– Data nodes are not involved
Start dfs service
– start-dfs.sh
– Verify daemons :
jps
If you get the namespace exception, copy the namespace id of the namenode
Paste it in the : /app/hadoop/tmp/dfs/data/current/VERSION file
– Stop : stop-dfs.sh
List/check HDFS
hadoop fsck / -files -blocks
hadoop fs -ls
hadoop fs -mkdir testdir
hadoop fs -ls
hadoop fsck / -files -blocks
91. Verify HDFS
• Stop dfs services
stop-dfs.sh
Verify Daemons :jps
No java processes should be running
92. Configuration HDFS -hdfs-site
.xml(Pseudo Distributed Mode)
Property
Name
Description
Default Value
dfs.name.dir
Directories for NameNode to store it's
persistent data(Comma seperated directory
name).A copy of metadata is stored in each of
listed directory
${hadoop.tmp.dir}/dfs/name
dfs.data.dir
Directories where DataNode stores
blocks.Each block is stored in only one of
these directories
fs.checkpoint.dir
${hadoop.tmp.dir}/dfs/data
Directories where secondary namenode
stores checkpoints.A copy of the checkpoint is
stored in each of the listed directory
${hadoop.tmp.dir}/dfs/namesecondary
93. Basic HDFS Commands
•
Creating Directory
hadoop fs -mkdir <dirname>
•
Removing Directory
hadoop fs -rm <dirname>
•
Copying files to HDFS from local file system
hadoop fs -put <local dir>/<filename> <hdfs dir Name>/
•
Copying files from HDFS to local file system
hadoop fs -get <hdfs dir Name>/<hdfs file Name> <local dir>/
•
List files and directories
hadoop fs -ls <dir name>
•
List the blocks that make up each files in HDFS
hadoop fsck / -files -blocks
94. HDFS Web UI
• Hadoop provides a web UI for viewing HDFS
– Available at http://namenode-host-ip:50070/
– Browse file system
– Log files
95. MapReduce
• A distributed parallel processing engine of Hadoop
• Processes the data in sequential parallel steps called
• Map
• Reduce
• Best run with a DFS supported by Hadoop to exploit it's parallel processing
abilities
• Has the ability to run on a cluster of computers
• Each computer called as a node
• Input and output data at every stage is handled in terms of key / value pairs
• Key / Value can be choose by programmer
• Mapper output is always sorted by key
• Mapper output with the same key are sent to the same reducer
• Number of mappers and reducers per node can be configured
97. The Overall MapReduce Word count Process
Input
Splitting
Mapping
Shuffling
Reducing
Final Result
98. Design of MapReduce -Daemons
The MapReduce system is managed by two daemons
• JobTracker & TaskTracker
• JobTracker TaskTracker function in master / slave fashion
– JobTracker coordinates the entire job execution
– TaskTracker runs the individual tasks of map and
reduce
– JobTracker does the bookkeeping of all the tasks run on
the cluster
– One map task is created for each input split
– Number of reduce tasks is configurable
(mapred.reduce.tasks)
108. MapReduce Programming
• Having seen the architecture of MapReduce, to
perform a job in hadoop a programmer need to create
• A MAP function
• A REDUCE function
• A Driver to communicate with the framework,
configure and launch the job
109. Map Function
•
The Map function is represented by Mapper class, which declares an abstract
method map()
• Mapper class is generic type with four type parameters for the input and output
key/ value pairs
•
Mapper <k1, v1, k2, v2>
•
kl, vi are the types of the input key / value pair
•
k2, v2 are the types of the output key/value pair
• Hadoop provides it's own types that are optimized for network serialization
• Text Corresponds to Java String
• LongWritable Corresponds to Java Long
• IntWritable Corresponds to Java Int
• The map() method must be implemented to achieve the input key / value
transformation
• map method is called by MapReduce framework passing the input key values from
the input file
• map method is provided with a context object to which the transformed key values
can be written to
110. Mapper — Word Count
•
•
public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
•
•
•
•
•
•
•
•
•
•
•
•
•
•
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws 1OException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value,toString0, " tnrf,.;:?[]");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken0.toLowerCase();
context.write(word, one);
}
}
111. Reduce Function
•
•
•
•
•
•
The Reduce function is represented by Reducer class, which declares an
abstract method reduce()
Reducer class is generic type with four type parameters for the input and output
key/ value pairs
• Reducer <k2, v2, k3, v3>
• k2, v2 are the types of the input key / value pair, this type of this pair must
match the output types of Mapper
• k3, v3 are the types of the output key / value pair
The reduce() method must be implemented to achieve the desired transformation
of input key / value
Reduce method is called by MapReduce framework passing the input key values
from out of map phase
MapReduce framework guarantees that the records with the same keys from all
the map tasks will reach a single reduce task
Similar to the map, reduce method is provided with a context object to which the
transformed key values can be written to
112. Reducer — Word Count
•
•
•
•
•
•
•
•
•
•
•
•
•
public static class IntSumReducer
extends Reducer< Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws I0Exception, InterruptedException
{
int sum = 0;
for (IntWritable value values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
113. MapReduce Job Driver— Word Count
Public class WordCount (
public static void main(String args[])) throws Exception {
if (args.length 1=2) {
System.errprintln("Usage: WordCount <input Path> <output Path>");
System.exit(-1); }
Configuration conf- new Configuration();
Job job = new Job(conf, "Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapperclass);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Textclass);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FllelnputFormat.addlnputPath(job, new Path(args[0]);
FlleOutputFormatsetOutputPathCob, new Path(args[1]);
System.exit(job.waitForCompletion(true) ? 0: 1);
}
}
114. MapReduce Job — Word Count
• Copy wordcount.jar from the Hard drive folder to the
Ubuntu VM
hadoop jar <path_to_jar>/wordcount.jar
com.hadoop.WordCount <hdfs_input_dir>/pg5000.txt
<hdfs_output_dir>
• The <hdfs_output_dir> must not exist
• To view the output directly:
hadoop fs –cat ../../part-r-00000
• To copy the result to local:
hadoop fs -get <part-r-*****>
115. The MapReduce Web UI
• Hadoop provides a web UI for viewing job
information
• Available at http://jobtracker-host:50030/
• follow job's progress while it is running
• find job statistics
• View job logs
• Task Details
116. Combiner
• Combiner function helps to aggregate the map
output before passing on to reduce function
• Reduces intermediate data to be written to disk
• Reduces data to be transferred over network
• Combiner for a job is specified as
job.setCombinerClass(<combinerclassname>.clas
s);
• Combiner is represented by same interface as
Reducer
121. Partitioning
• Map tasks partition their output keys by the number of reducers
• There can be many keys in a partition
• All records for a given key will be in a single partition
• A Partitioner class controls partitioning based on the Key
• Hadoop uses hash partition by default (HashPartitioner)
• The default behavior can be changed by implementing the getPartition()
method in the Partitioner (abstract) class
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int
numPartitions);
}
• A custom partitioner for a job can be set as
Job.setPartitionerClass(<customPartitionerClass>.class);
123. Partitioner Example
public class WordPartitioner extends Partitioner <Text, IntWritable>{
@Override
public int getPartition(Text key, Int-Writable value,int numPartitions){
String ch = key.toStringasubstring(0,1);
/*if (ch.matches("[abcdefghijklm]"))
return 0;
}
else if (ch.matches("[Inopqrstuvwxyzn]")){
return 1;
}
return 2; */
//return (ch.charAt(0) % numPartitions); //round robin based on ASCI value return 0;
return 0;// default behavior
}
}
124. Reducer progress during Mapper
• MapReduce job shows something like Map(50%)
Reduce(10%)
• Reducers start copying intermediate key-value pairs
from the mappers as soon as they are available. The
progress calculation also takes in account the
processing of data transfer which is done by reduce
process, therefore the reduce progress starts showing
up as soon as any intermediate key-value pair for a
mapper is available to be transferred to reducer.
• The programmer defined reduce method is called only
after all the mappers have finished.
125. Hadoop Streaming
• Hadoop Streaming is a utility which allows
users to create and run jobs with any
executables (e.g. shell utilities) as the mapper
and/or the reducer.
• Using the streaming system you can develop
working hadoop jobs with extremely limited
knowldge of Java
• Hadoop actually opens a process and writes
and reads from stdin and stdout
126. Hadoop Streaming
• Hadoop Streaming uses Unix standard streams
as the interface between Hadoop and your
program, so you can use any language that
• Can read standard input and write to standard
output to write your MapReduce program.
• Streaming is naturally suited for text
processing.
127. Hands on
• Folder :
– examples/Hadoop_streaming_python
– Files “url1” & “url2’ are the input
– multifetch.py is the mapper (open it)
– reducer.py is the reducer(open this as well)
129. Decomposing problems into M/R jobs
• Small map-reduce jobs are usually better
– Easier to implement, test & maintain
– Easier to scale & reuse
• Problem : Find the word/letter that has the
maximum occurrences in a set of documents
130. Decomposing
• Count of each word/letter
M/R job (Job 1)
• Find max word/letter count M/R job (Job 2)
Choices can depend on complexity of jobs
131. Job Chaining
• Multiple jobs can be run in a linear or complex dependent fashion
Simple Dependency /Linear Chain
Directed Acyclic Graph(DAG)
• Simple way is to call the job drivers one after the other with respective
configurations
JobClient.runJob(conf1);
JobClient.runJob(conf2);
• If a job fails, the runJob() method will throw an IOException, so later jobs
in the pipeline don’t get executed.
132. Job Chaining
For complex dependencies you can use JobControI, and
ControlledJob classes
Controlledjob cjob1 = new ControlledJob(conf1);
ControlledJob cjob2 = new ControlledJob(conf2);
cjob2.addDependingjob(cjob1);
JobControl jc = new JobControl("Chained Job");
jc.addjob(cjob1);
jc.addjob(cjob2);
jc.run();
133. Apache Oozie
•
•
•
•
•
•
•
Work flow scheduler for Hadoop
Manages Hadoop Jobs
Integrated with many Hadoop apps i.e. Pig
Scalable
Schedule jobs
A work flow is a collection of actions i.e.
– map/reduce, pig
A work flow is
– Arranged as a DAG ( direct acyclic graph )
– Graph stored as hPDL ( XML process definition )
134. Oozie
• Engine to build complex DAG workflows
• Runs in it’s own daemon
• Describe workflows in set of XML &
configuration files
• Has coordinator engine that schedules
workflows based on time & incoming data
• Provides ability to re-run failed portions of the
workflow
135. Need for High-Level Languages
• Hadoop is great for large-data processing!
– But writing Java programs for everything is
verbose and slow
– Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing
languages
– Hive: HQL is like SQL
– Pig: Pig Latin is a bit like Perl
136. Hive and Pig
• Hive: data warehousing application in Hadoop
– Query language is HQL, variant of SQL
– Tables stored on HDFS as flat files
– Developed by Facebook, now open source
• Pig: large-scale data processing system
– Scripts are written in Pig Latin, a dataflow language
– Developed by Yahoo!, now open source
– Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
– Provide higher-level language to facilitate large-data processing
– Higher-level language “compiles down” to Hadoop jobs
137. Hive: Background
• Started at Facebook
• Data was collected by nightly cron jobs into
Oracle DB
• “ETL” via hand-coded python
• Grew from 10s of GBs (2006) to 1 TB/day new
data (2007), now 10x that
Source: cc-licensed slide by Cloudera
138. Hive Components
•
•
•
•
Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS,
metadata)
• Metastore: schema, location in HDFS, SerDe
Source: cc-licensed slide by Cloudera
139. Data Model
• Tables
– Typed columns (int, float, string, boolean)
– Also, list: map (for JSON-like data)
• Partitions
– For example, range-partition tables by date
• Buckets
– Hash partitions within ranges (useful for sampling,
join optimization)
Source: cc-licensed slide by Cloudera
140. Metastore
• Database: namespace containing a set of
tables
• Holds table definitions (column types, physical
layout)
• Holds partitioning information
• Can be stored in Derby, MySQL, and many
other relational databases
Source: cc-licensed slide by Cloudera
141. Physical Layout
• Warehouse directory in HDFS
– E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables
• Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format
Source: cc-licensed slide by Cloudera
142. Hive: Example
• Hive looks similar to an SQL database
• Relational join on two tables:
– Table of word counts from Shakespeare collection
– Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
the
I
and
to
of
a
you
my
in
is
25848
23031
19671
18038
16700
14170
12702
11297
10797
8882
62394
8854
38985
13526
34654
8057
2720
4135
12445
6884
143. Hive: Behind the Scenes
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
(Abstract Syntax Tree)
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s) word) (.
(TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (.
(TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL k) freq))) (TOK_WHERE
(AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k) freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (.
(TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))
(one or more of MapReduce jobs)
144. Hive: Behind the Scenes
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s
TableScan
alias: s
Filter Operator
predicate:
expr: (freq >= 1)
type: boolean
Reduce Output Operator
key expressions:
expr: word
type: string
sort order: +
Map-reduce partition columns:
expr: word
type: string
tag: 0
value expressions:
expr: freq
type: int
expr: word
type: string
k
TableScan
alias: k
Filter Operator
predicate:
expr: (freq >= 1)
type: boolean
Reduce Output Operator
key expressions:
expr: word
type: string
sort order: +
Map-reduce partition columns:
expr: word
type: string
tag: 1
value expressions:
expr: freq
type: int
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:8022/tmp/hive-training/364214370/10002
Reduce Output Operator
key expressions:
expr: _col1
type: int
sort order: tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: int
expr: _col2
type: int
Reduce Operator Tree:
Extract
Limit
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
Stage: Stage-0
expr: ((_col0 >= 1) and (_col2 >= 1))
Fetch Operator
type: boolean
limit: 10
Select Operator
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col2
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
145. Example Data Analysis Task
Find users who tend to visit “good” pages.
Visits
Pages
url
time
url
pagerank
Amy
www.cnn.com
8:00
www.cnn.com
0.9
Amy
www.crap.com
8:05
www.flickr.com
0.9
Amy
www.myblog.com
10:00
www.myblog.com
0.7
Amy
www.flickr.com
10:05
www.crap.com
0.2
Fred
cnn.com/index.htm 12:00
...
user
...
148. MapReduce Code
i
i
i
i
m
m
m
m
p
p
p
p
o
o
o
o
r
r
r
r
t
t
t
t
j
j
j
j
a
a
a
a
v
v
v
v
a
a
a
a
.
.
.
.
i
u
u
u
o
t
t
t
.
i
i
i
I
l
l
l
O
.
.
.
E
A
I
L
x
r
t
i
c
r
e
s
e
a
r
t
p t i o n ;
y L i s t ;
a t o r ;
;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
/ /
f o r
i m p o r t
o r g . a p a c h e . h a d o o p . f s . P a t h ;
i m p o r t
o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;
i m p o r t
o r g . a p a c h e . h a d o o p . i o . T e x t ;
i m p o r t
o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;
im p o r t
o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p
i m p o r t
o r g . a . h a d o o p . m a p r e d . M a p p e r ;
p a c h e
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;
i mo r t
p
o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;
i m p o r t
o r g . a p a c h e . h a d o o p . m a p roendt.rjoolb;c o n t r o l . J o b C
i m p o r t
o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p
/
S
i
S
S
T
/
/
T
o
/
t
n
t
t
e
/
/
e
c
r
t
r
r
x
x
.
v o i
O u t
R e
P u l
i n g
f i
i n g
i n g
t
o
P r e
i t
t
o
c o l
d
m a
p u t C
p o r t
l
t h
l i n
r s t C
k e y
s t
v a l
u t K e
p e n d
c a m e
u t V a
l e c t
p
o
e
e
e
o
r
u
y
( L o n g W r i t a b
l l e c t o r < T e x
r
r e p o r t e r )
k e y
o u t
=
v a l . t o S t
m m a
=
l i n e .
= n l i n e . s u b s
i
g ( 0 ,
f i r
e
=
l i n e . s u
=
n e w
T e x t
a n
i n d e x
t o
f r o m .
l" =+ nveawl uTee)x;t
( o u t K e y ,
o u t
;
p e r ;
,
}
p
p t i o
r i t a
o m m a
w e
k
( " 1
V a l ) ;
}
}
p u b l i c
s t a t i c
c l a s s
L o a d A n d F i l t e r U s e r s
e x t e n d s
M
i m p l e m e n t s
M a p p e r < L o n g W r i t a b l e ,
T e x t ,
T e x t ,
l ,
e p t i o
}
p
T e x t >
t C o m m
w
;
}
p u b l i c
s t a t i c
c l a s s
J o i n
e x t e n d s
M a p R e d u c e B a s e
i m p l e m e n t s
R e d u c e r < T e x t ,
T e x t ,
T e x t ,
T e x t >
p u b l i c
/ /
s t o r e
v o
I
O
R
F o
i
t
u
e
r
d
r e
e r a t
t p u t
p o r t
e a c
d
o
C
e
h
u c e ( T
r < T e x
o l l e c
r
r e p
v a l u
e
t
t
o
e
x t
k e y
>
i t e r
o r < T e x
r t e r )
,
f i g u
}
p
{
,
,
t ,
T e x t >
o c ,
t h r o w s
I O E x c e p t i o
r e
o u t
w h i c h
f i l e
i t
/ /
a c c o r d i n g l y .
L i s t < S t r i n g >
f i r s t
=
n e w
A r r a y L i s t < S t r i n
L i s t < S t r i n g >
s e c o n d
=
n e w
A r r a y L i s t < S t r i
w h i l e
T
S
i
f i r s t . a d d ( v a l u e . s
e
e
t
f
u
l
( i t
x t
r i n
( v
b s t
s e
e r . h a s N e x t
t
=
i t e r . n
g tv a l u e )=
S
r i n g (
;
a l u e . c h a r A
r i n g ( 1 ) ) ;
s e c o n d . a d d
Pig Slides adapted from Olston et al.
(
e
t
t
)
x
.
(
)
{
t ( ) ;
t o
0 )
= =
s
:
g
p r o d u c t
f i r s t )
s 2
:
s e
o u t v a l
=
l e c t ( n u l l
e r . s e t S t a
l p . s e t O u t p u t K e y C l
l p . s e t O u t p u t V a l u e
l p . s e t M a p p e r C l a s s
t h e
v a l u e s e I n p u t F o r m a t . a
F i l
P a t h ( " / r / g a t e s / p a g e s " ) ) ;
u s e
d )
{
F i l e O u t p u t F o r m a t .
y
+
" , "
+
s 1
+
" , "
+
s 2 ;
n e w
P a t h ( " / u s
e w
T e x t ( o u t v a l ) ) ; l p . s e t N u m R e d u c e T a
( " O K " ) ;
J o b
l o a d P a g e s
=
n
a n d
{
c o
k
,
t u
n
e
n
s
c o l l e c t
}
p
' 1 ' )
( v a l u e . s u b s t r i n g ( 1 ) ) ;
J
e
l
l
e
l
e
l
l
F
P a t h ( " / u s
F
o
t
f
f
f
>
f
f
i
e
i
b
J
u
u
u
u
u
l
r
l
C
o
.
.
.
{
.
.
e
/
e
o
b
s
s
s
s
s
I
g
O
n
{ l f u . s
J o b
l
n
N
e
e
e
e
n
a
u
e
e
o
f
a
t
t
t
t
p
t
t
w
t
a
a
C
(
d
s
l
L
d
s
a
o
I
(
s
a
n
T
s
d
p
e
(
P
u
s
e
s
e
e t O u t
r / g a t
k s ( 0 )
w
J o b
l f u
=
n e w
J o b C o
e ( " L o a d
a n d
F i l
n p u t F o r m a t ( T e x t
u t p u t K e y C l a s s ( T
u t p u t V a l u e C l a s s
a p p e r C l a s s ( L o a d
t F o r m a t . a d d f u ,
I n p u t P a t h ( l
s / u s e r s " ) ) ;
u t F o r m a t . s e t O u t
P a t h ( " / u s e r / g a t
N u m R e d u c e T a s k s ( 0
d U s e r s
=
n e w
J o b
m
I
O
O
M
u
e
p
x
T
a
t
t
e
g
P
.
x
e
a
c
t
s
t
l
.
.
h
a
c
c
(
s
l
l
l
s
a
a
p
) ;
s s ) ;
s s ) ;
,
n e w
p u t P a t h ( l p ,
e s / t m p / i n d e x e
;
( l p ) ;
n f ( M
t e r
I n p u
e x t .
( T e x
A n d F
n e w
R
U
t
c
t
i
E
s
F
l
.
l
x
e
o
a
c
t
a
r
r
s
l
e
m
s
m
s
a
r
p
"
a
)
s
U
l
)
t
;
s
s
e .
;
. c
) ;
e r
v o i d
m a p (
T e x t
k ,
T e x t
v a l ,
p u t P a t h ( l f u ,
O u t p u t C o l l e L o n g W r i t a b l e >
c t o r < T e x t ,
o c ,
e s / t m p / f i l t e r
a t ;
R e p o r t e r
r e p o r t e r )
t h r o w s
I O E x c e p t i o n
) ;
r m a t ;
/ /
F i n d
t h e
u r l
( l f u ) ;
S t r i n g
l i n e
=
v a l . t o S t r i n g ( ) ;
i n t
f i r s t C o m m a
=
l i n e . i n d e x O f ( ' , ' ) ;
J o b C o n f
j o i n
= Rn e w mJ o b C o n f ( s ) ;
M
E x a
p l e . c l a s
i n t
s e c o n d C o m m a
=
l i n e . i n d e x O f ( ' , ' ,
C o m m a ) ;
f i r s t
j o i n . s e t J o b N a m e ( " J o i n
U s e r s
a n d
P a g e
S t r i n g
k e y
=
l i n e . s u b s t r i n g ( f i r s t C o m m a ,
s e c o n d C o m m a ) ; p u t F o r m a t ( K e y V a l u e T e x t I n p u
j o i n . s e t I n
/ /
d r o p
t h e
r e s t
o f
t h e
r e c o r d ,
I
d o n ' t
n e e d o i t . a n y m o r e , t K e y C l a s s ( T e x t . c l a s s ) ;
j
n
s e t O u t p u
/ /
j u s t
p a s s
a
1
f o r
t h e
c o m b i n e r / r e d u c e r
t o o s u m s i n s t e a d . V a l u e C l a s s ( T e x t . c l a s s )
j
i n .
e t O u t p u t
a s e
T e x t
o u t K e y
=
n e w
T e x t ( k e y ) ;
j o i n . s e t M a p p e r C lp e r . c l a s s ) ; y M a p
a s s ( I d e n t i t
T e x t >
{ c . c o l l e c t ( o u t K e y ,
o
n e w
L o n g W r i t a b l e ( 1 L ) ) ;
j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;
}
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n ,
n
P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;
u b l i c
s t a t i c
c l a s s
R e d u c e U r l s
e x t e n d s
M a p R e d u c e B a s e i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n ,
F
n
n
{ i m p l e m e n t s
R e d u c e r < T e x t ,
L o n g W r i t a b l e ,
W r i t a b l e C o m p a r a b l e , m p / f i l t e r e d _ u s e r s " ) ) ;
P a t h ( " / u s e r / g a t e s / t
b l e >
{
F i l e O u t p u t F o r m a t . s e ( j o i n ,
t O u t p u t P a t h
n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
p u b l i c
v o i d
r e d u c e (
j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
y , T e x t
k e
J o b
j o i n J o b
=
n e w
J o b ( j o i n ) ;
+
1 ) ;
I t e r a t o r < L o n g W r i t a b l e >
i t e r ,
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ;
O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e ,
W r i t a b l e > b o c , d D e p e n d i n g J o b ( l o a d U s e r s ) ;
j o i n J o
. a d
n o w
w h i c h
f i l e o r t e r
R e p
r e p o r t e r )
t h r o w s
I O E x c e p t i o n
{
/ /
A d d
u p
a l l
t h e
v a l u e s
w e
s e e
J o b C o n f
g r o u p
= x a m p l e . c l a s s ) ; E
n e w
J o b C o n f ( M R
g r o u p . s e t J o b N a m e ( " G r o u p
U R L s " ) ;
l o n g
s u m
=
0 ;
g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p
i l e w(hi t e r . h a s N e x t ( ) )
{
g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
s u m
+ =
i t e r . n e x t ( ) . g e t ( ) ;
g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b
a p R e d u c e B a s e e p o r t e r . s e t S t a t u s ( " O K " ) ;
r
g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F i l a s s
l e O u t p u t F o r m a t . c
T e x t >
{ }
g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l
o c . c o l l e c t ( k e y ,
n e w
L o n g W r i t a b l e ( s u m ) ) ;
g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a
}
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p ,
n
{
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
u b l i c
s t a t i c
c l a s s
L o a d C l i c k s
e x t e n d s
M a p R e d u c e B a s e e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p ,
F i l
m p lie m e n t s
M a p p e r < W r i t a b l e C o m p a r a b l e ,
W r i t aPbalteh,( "L/ounsgeWrr/igtaatbelse/,t m p / g r o u p e d " ) ) ;
{
g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b
g r o u p J o b
=
n e w
J o b ( g r o u p ) ;
p u b l i c
v o i d
m a p (
g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;
W r i t a b l e C o m p a r a b l e
k e y ,
a ) ;
W r i t a b l e
v a l ,
J o b C o n f
t o p 1 0 0
=
n e w
J o b C o n f ( M R E x a m p
O u t p u t C o l l e c t o r < L o n g W r i t a b l e ,
T e x t >
o c , o p 1 0 0 . s e t J o b N a m e ( " T o p
t
1 0 0
s i t e s " ) ;
R e p o r t e r or e p o r t e r ) p t i o n
t h r
w s
I O E x c e
{
t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n
o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l ,
( T e x t ) k e y ) ;
t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l
}
t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s
t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O
o r m a t . c l a s s ) ;
u b l i c
s t a t i c
c l a s s
L i m i t C l i c k s
e x t e n d s
M a p R e d u c e B a s e p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a
t o
i m p l e m e n t s
R e d u c e r < L o n g W r i t a b l e ,
T e x t ,
L o n g W r i t a b l e , 0 T e x t > o { b i n e r C l a s s ( L i m i t C l i c k s .
t o p 1 0
. s e
C
m
t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c
i n t
c o u n t
=
0 ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 ,
p uvboliidc r e d u c e (
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
L o n g W r i t a b l e
k e y ,
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 ,
I t e r a t o r < T e x t >
i t e r ,
P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 "
O u t p u t C o l l e c t o r < L o n g W r i t a b l e ,
T e x t >
o c ,
t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
n
{
R e p o r t e r
r e p o r t e r )
t h r o w s
I O E x c e p t i o n
{
J o b
l i m i t
=
n e w
J o b ( t o p 1 0 0 ) ;
i t ' s
f r o m
a n d
l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;
/ /
O n l y
o u t p u t
t h e
f i r s t
1 0 0
r e c o r d s
w h i l e 1( c o u n t i t e r . h a s N e x t ( ) )
<
0 0
& &
{
J o b C o n t r o l
j c
=
n e w
J o b C o n t r o l ( " F i n d
1 0 0
s i t e s
f o r
g > ( ) ;
o c . c o l l e c t ( k e y ,
i t e r . n e x t ( ) ) ;
1 8
t o
2 5 " ) ;
n g > ( ) ;
c o u n t + + ;
j c . a d d J o b ( l o a d P a g e s ) ;
}
j c . a d d J o b ( l o a d U s e r s ) ;
}
j c . a d d J o b ( j o i n J o b ) ;
j c . a d d J o b ( g r o u p J o b ) ;
u b l i c
s t a t i c
v o i d
m a i n ( S t r i n g [ ]
a r g s )
t h r o w s
I O E x c e p t i o n J { b ( l i m i t ) ;
j c . a d d
o
J o b C o n f
l p
=
n e w
J o b C o n f ( M R E x a m p l e . c l a s s ) ;
j c . r u n ( ) ;
t J o b N a m e ( " L o a d
l p . s e
P a g e s " ) ;
}
l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
}
p u b l i c
t F o r m
p u t F o
;
l e
k ,
T e x t
v a l
t ,
T e x t >
o c ,
t h r o w s
I O E x c e
W
r i n g ( ) ;
i n d e x O f ( ' , ' ) ;
t C o m m a ) ;
b s t r i n g ( f i r s t C
( k e y ) ;
t h e
v a l u e
s o
p u b l i c
v o i d
m a p ( L o n g W r i t a b l e
k ,
T e x t
v a
O u t p u t C o l l e c t o r < T e x t ,
T e x t >
o c ,
R e p o r t e r
r e p o r t e r )
t h r o w s
I O E x c
/ /
P u l l
t h e
k e y
o u t
S t r i n g
l i n e
=
v a l . t o S t r i n g ( ) ;
i n t
f i r s t C o m m a
=
l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g
v a l u e
=f ilrisnteC.osmumbas t+r i1n)g;(
i n t
a g e
=
I n t e g e r . p a r s e I n t ( v a l u e ) ;
i f
( a g e
<
1 8
| |
a g e
>
2 5 )
r e t u r n ;
S t r i n g
k e y
=
l i n e . s u b s t r i n g ( 0 ,
f i r s
T e x t
o u t K e y
=
n e w
T e x t ( k e y ) ;
/ /
P r e p e n d
a n
i n d ee
x
k n o w hw h i c h uf i l e
t o
t
e
v a l
e
s o
/ /
i t
c a m e
f r o m .
T e x t
o u t V a l
=
n e w
T e x t ( " 2 "
+
v a l u e )
o c . c o l l e c t ( o u t K e y ,
o u t V a l ) ;
}
r o s
s 1
r i n
i n g
c o l
o r t
}
}
}
}
u t F o r m a t ; c
p u b l i
s t a t i c
c l a s s
L o a d J o i n e d
e x t e n d s
M a p R e d u c e B a s
i m p l e m e n t s
M a p p e r < T e x t ,
T e x t ,
T e x t ,
L o n g W r i t a b l
;
t ;
p u b l i c
c l a s s
M R E x a m p l e
{
p u b l i c
s t a t i c
c l a s s
L o a d P a g e s
e x t e n d s
M a p R e d u c e B
i m p l e m e n t s
M a p p e r < L o n g W r i t a b l e ,
T e x t ,
T e x t ,
p u b l i c
D o
t h e
c
( S t r i n g
f o r
( S t
S t r
o c .
r e p
149. Pig Latin Script
Visits = load ‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load
‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;
store GoodUsers into '/data/good_users';
150. HIVE
•
•
•
•
A datawarehousing framework built on top of Hadoop
Started at Facebook in 2006
Targets users are data analysts comfortable with SQL
Allows to query the data using a SQL-like language called
HiveQL
• Queries compiled into MR jobs that are executed on
Hadoop
• Meant for structured data
152. Hive Architecture cont.
• Can interact with Hive using :
•
•
•
CLI(Command Line Interface)
JDBC
Web GUI
• Metastore – Stores the system catalog and metadata about
tables, columns, partitions etc.
• Driver – Manages the lifecycle of a HiveQL statement as it
moves through Hive.
Query Compiler – Compiles HiveQL into a directed acyclic
graph of map/reduce tasks.
• Execution Engine – Executes the tasks produced by the
compiler interacting with the underlying Hadoop instance.
• HiveServer – Provides a thrift interface and a JDBC/ODBC
server.
153. Hive Architecture cont.
• Physical Layout
• Warehouse directory in HDFS
• e.g., /user/hive/warehouse
• Table row data stored in subdirectories of warehouse
• Partitions form subdirectories of table directories
• Actual data stored in flat files
• Control char-delimited text, or SequenceFiles
154. Hive Vs RDBMS
• Latency for Hive queries is generally very high (minutes) even
when data sets involved are very small
• On RDBMSs, the analyses proceed much more iteratively with
the response times between iterations being less than a few
minutes.
• Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test
queries.
• Hive is not designed for online transaction processing and
does not offer real-time queries and row level updates.
It is best used for batch jobs over large sets of immutable data
(like web logs).
155. Supported Data Types
• Integers
•
•
BIGINT(8 bytes), INT(4 bytes), SMALLINT(2 bytes), TINYINT(1
byte).
All integer types are signed.
• Floating point numbers
•
FLOAT(single precision), DOUBLE(double precision)
• STRING : sequence of characters
• BOOLEAN : True/False
• Hive also natively supports the following complex types:
•
•
•
Associative arrays – map<key-type, value-type>
Lists – list<element-type>
Structs – struct<field-name: field-type, ... >
156. Hive : Install & Configure
• Download a HIVE release compatible with your Hadoop
installation from :
• http://hive.apache.org/releases.html
• Untar into a directory. This is the HIVE’s home directory
• tar xvzf hive.x.y.z.tar.gz
• Configure
• Environment variables – add in .bash_profile
• export HIVE_INSTALL=/<parent_dir_path>/hive-x.y.z
• export PATH=$PATH:$HIVE_INSTALL/bin
• Verify Installation
• Type : hive –help (Displays commands usage)
• Type : hive (Enter the hive shell)
hive>
157. Hive : Install & Configure cont.
•
•
•
•
•
•
•
•
•
Start Hadoop daemons (Hadoop needs to be running)
Configure to Hadoop
Create hive-site.xml in $HIVE_INSTALL/conf directory
Specify the filesystem and jobtracker using the properties
fs.default.name & mapred.job.tracker
If not set, these default to the local filesystem and the local(inprocess) job-runner
Create following directories under HDFS
/tmp (execute: hadoop fs –mkdir /tmp)
user/hive/warehouse (execute: hadoop fs –mkdir
/user/hive/warehouse)
chmod g+w for both (execute : hadoop fs –chmod g+w <dir_path>)
158. Hive : Install & Configure cont.
• Data Store
• Hive stores data under /user/hive/warehouse by default
• Metastore
• Hive by default comes with a light wight SQL database
Derby to store the metastore metadata.
• But this can be configured to other databases like MySQL
as well.
• Logging
• Hive uses Log4j
• You can find Hive’s error log on the local file system
at /tmp/$USER/hive.log
159. Hive Data Models
• Databases
• Tables
• Partitions
• Each Table can have one or more partition Keys which
determines how the data is stored.
• Partitions - apart from being storage units - also allow the user
to efficiently identify the rows that satisfy a certain criteria.
• For example, a date_partition of type STRING and
country_partition of type STRING.
• Each unique value of the partition keys defines a partition of the
Table. For example all “India" data from "2013-05-21" is a
partition of the page_views table.
• Therefore, if you run analysis on only the “India" data for 201305-21, you can run that query only on the relevant partition of
the table thereby speeding up the analysis significantly.
160. Partition example
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
• When we load data into a partitioned table, the partition
values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');
161. Hive Data Model cont.
• Buckets
• Data in each partition may in turn be divided into
Buckets based on the value of a hash function of some
column of the Table.
• For example the page_views table may be bucketed
by userid, which is one of the columns, other than the
partitions columns, of the page_views table. These
can be used to efficiently sample the data.
162. A Practical session
Starting the Hive CLI
• Start a terminal and run :
• $hive
• Will take you to the hive shell/prompt
hive>
• Set a Hive or Hadoop conf prop:
• hive> set propkey=propvalue;
• List all properties and their values:
• hive> set –v;
163. A Practical Session
Hive CLI Commands
• List tables:
– hive> show tables;
• Describe a table:
– hive> describe <tablename>;
• More information:
– hive> describe extended <tablename>;
164. A Practical Session
Hive CLI Commands
• Create tables:
– hive> CREATE TABLE cite (citing INT, cited INT)
>ROW FORMAT DELIMITED
>FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
• The 2nd and the 3rd lines tell Hive how the data is stored (as a text
file)and how it should be parsed (fields are separated by commas).
• Loading data into tables
• Let’s load the patent data into table cite
hive> LOAD DATA LOCAL INPATH ‘<path_to_file>/cite75_99.txt’
> OVERWRITE INTO TABLE cite;
• Browse data
hive> SELECT * FROM cite LIMIT 10;
165. A Practical Session
Hive CLI Commands
•
Count
hive>SELECT COUNT(*) FROM cite;
Some more playing around
• Create table to store citation frequency of each patent
hive> CREATE TABLE cite_count (cited INT, count INT);
•
Execute the query on the previous table and store the results :
•
•
•
hive> INSERT OVERWRITE TABLE cite_count
> SELECT cited, COUNT(citing)
> FROM cite
> GROUP BY cited;
•
Query the count table
hive> SELECT * FROM cite_count WHERE count > 10 LIMIT 10;
•
Drop Table
hive> DROP TABLE cite_count;
166. Data Model
Partitioning Data
• One or more partition columns may be specified:
hive>CREATE TABLE tbl1 (id INT, msg STRING)
PARTITIONED BY (dt STRING);
• Creates a subdirectory for each value of the partition column, e.g.:
/user/hive/warehouse/tbl1/dt=2009-03-20/
• Queries with partition columns in WHERE clause will scan through
only a subset of the data.
167. Managing Hive Tables
• Managed table
• Default table created (without EXTERNAL keyword)
• Hive manages the data
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• Moves the data into the warehouse directory for the table.
DROP TABLE managed_table;
• Deletes table data & metadata.
168. Managing Hive Tables
•
External table
• You control the creation and deletion of the data.
• The location of the external data is specified at table creation time:
• CREATE EXTERNAL TABLE external_table (dummy STRING)
• LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
• DROP TABLE external_table
Hive will leave the data untouched and only delete the metadata.
169. Conclusion
• Supports rapid iteration of adhoc queries
• High-level Interface (HiveQL) to low-level infrastructure
(Hadoop).
• Scales to handle much more data than many similar
systems
173. PIG
• PIG is an abstraction layer on top of MapReduce that frees analysts from the
complexity of MapReduce programming
• Architected towards handling unstructured and semi structured data
• It's a dataflow language, which means the data is processed in a sequence of steps
transforming the data
• The transformations support relational-style operations such as filter, union, group,
and join.
• Designed to be extensible and reusable
• Programmers can develop own functions and use (UDFs)
• Programmer friendly
• Allows to introspect data structures
• Can do sample run on a representative subset of your input
• PIG internally converts each transformation into a MapReduce job and submits to
hadoop cluster
• 40 percent of Yahoo's Hadoop jobs are run with PIG
174. Pig : What for ?
• An ad-hoc way of creating and executing
map-reduce jobs on very large data sets
• Rapid development
• No Java is required
• Developed by Yahoo
176. Pig Use cases
Processing of Web Logs.
Data processing for search platforms.
Support for Ad Hoc queries across large
datasets.
Quick Prototyping of algorithms for
processing large datasets.
177. Use Case in Healthcare
Problem Statement:
De-identify personal health information.
Challenges:
Huge amount of data flows into the systems daily and there are multiple data
sources that we need to aggregate data from.
Crunching this huge data and deidentifying it in a traditional way had problems.
179. When to Not Use PIG
• Really nasty data formats or completely
unstructured data
(video, audio, raw human-readable text).
• Pig is definitely slow compared to Map Reduce
jobs.
• When you would like more power to optimize
your code.
180. PIG Architecture
• Pig runs as a client side application ,there is no
need to install anything on the cluster.
181. Install and Configure PIG
• Download a version of PIG compatible with your hadoop installation
• http://pig.apache.org/releases.html
• Untar into a designated folder. This will be Pig's home directory
• tar xzf pig-x.y.z.tar.gz
• Configure
• Environment Variables add in .bash_profile
• export PIG_INSTALL=/<parent directory path>/pig-x.y.z
• export PATH=$PATH:$PIG_INSTALL/bin
• Verify Installation
• Try pig -help
• Displays command usage
• Try pig
• Takes you into Grunt shell grunt>
182. PIG Execution Modes
• Local Mode
• Runs in a single JVM
• Operates on local file system
• Suitable for small datasets and for development
• To run PIG in local mode
• pig -x local
• MapReduce Mode
• In this mode the queries are translated into MapReduce jobs and
run on hadoop cluster
• PIG version must be compatible with hadoop version
• Set HADOOP HOME environment variable to indicate pig which
hadoop client to use
• export HADOOP_HOME=$HADOOP_INSTALL
• If not set it will use a bundled version of hadoop
187. Ways of Executing PIG programs
Grunt
options
• Script
• An interactive shell for running Pig commands
• Grunt is started when the pig command is run without any
• Pig commands can be executed directly from a script file
pig pigscript.pig
• It is also possible to run Pig scripts from Grunt shell using run
and exec.
• Embedded
• You can run Pig programs from Java using the PigServer class,
much like you can use JDBC
• For programmatic access to Grunt, use PigRunner
188. An Example
Create a file sample.txt with(tab delimited) :
1932
1905
23
12
2
1
And so on.
grunt> records = LOAD ‘<your_input_dir>/sample.txt'
AS (year:chararray, temperature:int, quality:int);
DUMP records;
grunt>filtered_records = FILTER records BY temperature !=9999 AND
(quality ==0 OR quality ==1 OR quality ==4);
189. Example cont.
grunt>grouped_records = GROUP
filtered_records BY year;
grunt>max_temp = FOREACH grouped_records
GENERATE group,
MAX (filtered_records.temperature);
grunt>DUMP max_temp;
191. Data Types
• Simple Type
Category
Type
Description
int
32 - bit signed integer
long
64 - bit signed integer
float
32-bit floating-point number
double
64-bit floating-point number
Text
chararray
Character array in UTF-16 format
Binary
bytearray
Byte array
Numeric
192. Data Types
• Complex Types
Type
Description
Example
Tuple
Sequence of fields of any type
(1 ,'pomegranate' )
Bag
An unordered collection of tuples, possibly with
{(1 ,'pomegranate' ) ,(2)}
duplicates
map
A set of key-value pairs; keys must be character
arrays but values may be any type
['a'#'pomegranate' ]
193. LOAD Operator
<relation name> = LOAD '<input file with path>' [USING UDF()]
[AS (<field namel>:dataType, <field nome2>:dataType„<field
name3>:dataType)]
• Loads data from a file into a relation
• Uses the PigStorage load function as default unless specified otherwise with
the USING option
• The data can be given a schema using the AS option.
• The default data type is bytearray if not specified
records=LOAD 'sales.txt';
records=LOAD 'sales.txt' AS (fl:chararray, f2:int, f3:f1oat.);
records=LOAD isales.txt' USING PigStorage('t');
records= LOAD (sales.txt' USING PigStorage('t') AS (fl:chararray,
f2:int,f3:float);
194. Diagnostic Operators
• DESCRIBE
• Describes the schema of a relation
• EXPLAIN
• Display the execution plan used to compute
a relation
• ILLUSTRATE
• Illustrate step-by-step how data is
transformed
• Uses sample of the input data to simulate
the execution.
195. Data Write Operators
• LIMIT
• Limits the number of tuples from a relation
• DUMP
• Display the tuples from a relation
• STORE
• Store the data from a relation into a
directory.
• The directory must not exist
196. Relational Operators
• FILTER
• Selects tuples based on Boolean expression
teenagers = FILTER cust BY age <20;
• ORDER
• Sort a relation based on one or more fields
• Further processing (FILTER, DISTINCT, etc.) may
destroy the ordering
ordered list = ORDER cust BY name DESC;
• DISTINCT
• Removes duplicate tuples
unique_custlist = DISTINCT cust;
197. Relational Operators
• GROUP BY
• Within a relation, group tuples with the same group key
• GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL
• FOR EACH
• Loop through each tuple in nested_alias and generate new
tuple(s)
. countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
• Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
198. Relational Operators
• GROUP BY
• Within a relation, group tuples with the same group key
• GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL
• FOR EACH
• Loop through each tuple in nested alias and generate new
tuple(s).
• At least one of the fields of nested alias should be a bag
• DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE are allowed
operations in nested op to operate on the inner bag(s).
• countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
• Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
199. Operating on Multiple datasets
• Join
Compute inner join of two or more relations based on common fields values
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)
X=JOIN A BY a1,B BY b1
DUMP X;
(1,2,3,1,3)
(8,3,4,8,9)
(7,2,5,7,9)
200. Operating on Multiple datasets
• COGROUP
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
Group tuples from two or more relations,based on common group values
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)
X=COGROUP A BY a1,B BY b1
DUMP X;
(1,{(1,2,3)},{(1,3)})
(8,{(8,3,4)},{(8,9)})
(7,{(7,2,5)},{(7,9)})
(2,{},{(2,4),(2,7)})
(4,{(4,2,1),(4,3,3)},{})
201. Joins & Cogroups
• JOIN and COGROUP operators perform similar functions.
• JOIN creates a flat set of output records while COGROUP creates
a nested set of output records
202. Data
File – student
File – studentRoll
Name
Age
GPA
Name
RollNo
Joe
18
2.5
Joe
45
3.0
Sam
24
Sam
Angel
21
7.9
Angel
1
John
17
9.0
John
12
Joe
19
2.9
Joe
19
203. Pig Latin - GROUP Operator
Example of GROUP Operator:
A = load 'student' as (name:chararray, age:int, gpa:float);
dump A;
(joe,18,2.5)
(sam,,3.0)
(angel,21,7.9)
(john,17,9.0)
(joe,19,2.9)
X = group A by name;
dump X;
(joe,{(joe,18,2.5),(joe,19,2.9)})
(sam,{(sam,,3.0)})
(john,{(john,17,9.0)})
(angel,{(angel,21,7.9)})
204. Pig Latin – COGROUP Operator
Example of COGROUP Operator:
A = load 'student' as (name:chararray, age:int,gpa:float);
B = load 'studentRoll' as (name:chararray, rollno:int);
X = cogroup A by name, B by name;
dump X;
(joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)})
(sam,{(sam,,3.0)},{(sam,24)})
(john,{(john,17,9.0)},{(john,12)})
(angel,{(angel,21,7.9)},{(angel,1)})
205. Operating on Multiple datasets
UNION
Creates the union of two or more relations
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
X=UNION A , B;
DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(2,4)
(8,9)
DUMP B;
(2,4)
(8,9)
206. Operating on Multiple datasets
SPLITS
Splits a relation into two or more relations, based on a Boolean
expression
Y=SPLIT X INTO C IF a1<5 , D IF a1>5;
DUMP C;
DUMP D;
(1,2,3)
(4,2,1)
(8,3,4)
(2,4)
(8,9)
207. User Defined Functions (UDFs)
• PIG lets users define their own functions and lets them be
used in the statements
• The UDFs can be developed in Java, Python or Javascript
– Filter UDF
• To be subclassof FilterFunc which is a subclass ofEvalFunc
– Eval UDF
To be subclassed of EvalFunc
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws 1OException;
}
– Load / Store UDF
• To be subclassed of LoadFunc /StoreFunc
208. Creating UDF : Eval function example
public class UPPER extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
String str = (String) input.get(0);
return str.toUpperCase(); // java upper case
function
} catch (Exception e) {
throw new IOException("Caught exception processing
input...", e);
}
}
}
209. Define and use an UDF
• Package the UDF class into a jar
• Define and use a UDF
• REGISTER yourUDFjar_name.jar;
•
cust = LOAD some_cust_data;
• filtered = FOREACH cust GENERATE
•
com.pkg.UPPER(title) ;
DUMP filtered;
210. Piggy Bank
• Piggy Bank is a place for Pig users to share the
Java UDFs they have written for use with Pig.
The functions are contributed "as-is.“
• Piggy Bank currently supports Java UDFs.
• No binary version to download, download
source, build and use.
211. Pig Vs. Hive
• Hive was invented at Facebook.
• Pig was invented at Yahoo.
• If you know SQL, then Hive will be very
familiar to you. Since Hive uses SQL, you will
feel at home with all the
familiar select, where, group by, and order
by clauses similar to SQL for relational
databases.
212. Pig Vs. Hive
• Pig needs some mental adjustment for SQL
users to learn.
• Pig Latin has many of the usual data
processing concepts that SQL has, such as
filtering, selecting, grouping, and ordering, but
the syntax is a little different from SQL
(particularly the group
by and flatten statements!).
213. Pig Vs. Hive
• Pig gives you more control and optimization over
the flow of the data than Hive does.
• If you are a data engineer, then you’ll likely feel
like you’ll have better control over the dataflow
(ETL) processes when you use Pig Latin, if you
come from a procedural language background.
• If you are a data analyst, however, you will likely
find that you can work on Hadoop faster by using
Hive, if your previous experience was more with
SQL.
214. Pig Vs. Hive
• Pig Latin allows users to store data at
any point in the pipeline without disrupting
the pipeline execution.
• Pig is not meant to be an ad-hoc query tool.
215. Pig Vs. Hive
• Pig needs some mental adjustment for SQL
users to learn.
• Pig Latin has many of the usual data
processing concepts that SQL has, such as
filtering, selecting, grouping, and ordering, but
the syntax is a little different from SQL
(particularly the group
by and flatten statements!).
217. What is Apache Mahout
• Mahout is an open source machine learning
library from Apache.
• Scalable, can run on Hadoop
• Written in Java
• Started as a Lucene sub-project, became an
Apache top-level-project in 2010.
250
218. Machine Learning : Definition
• “Machine Learning is programming computers to
optimize a performance criterion using example data or
past experience”
– Intro. To Machine Learning by E. Alpaydin
• Subset of Artificial Intelligence
• Lots of related fields:
– Information Retrieval
– Stats
– Biology
– Linear algebra
– Many more
219. Machine Learning
• Machine learning, a branch of artificial intelligence, concerns the
construction and study of systems that can learn from data.
• Supervised learning is tasked with learning a function from labeled
training data in order to predict the value of any valid input.
– Common examples of supervised learning include classifying e-mail messages as
spam.
• Unsupervised learning is tasked with making sense of data without any
examples of what is correct or incorrect. It is most commonly used for
clustering similar input into logical groups.
252
220. Common Use Cases
•
•
•
•
•
Recommend friends/dates/products
Classify content into predefined groups
Find similar content
Find associations/patterns in actions/behaviors
Identify key topics/summarize text
– Documents and Corpora
• Detect anomalies/fraud
• Ranking search results
• Others?
221. Apache Mahout
• An Apache Software Foundation project to create
scalable machine learning libraries under the Apache
Software License
– http://mahout.apache.org
• Why Mahout?
– Many Open Source ML libraries either:
•
•
•
•
•
Lack Community
Lack Documentation and Examples
Lack Scalability
Lack the Apache License
Or are research-oriented
225. Key points
• Automatic discovery of the most relevant content
without searching for it
• Automatic discovery and recommendation of the
most appropriate connection between people and
interests
• Personalization and presentation of the most
relevant content (content, inspirations,
marketplace, ads) at every page/touch point
226. Business objectives
• More revenue from incoming traffic
• Improved consumer interaction and loyalty, as
each page has more interesting and relevant
content
• Drives better utilization of assets within the
platform by linking “similar” products
227. Basic approaches to Recommendations
on social sites
• Collaborative Filtering (CF): Collaborative filtering first
computes similarity between two users based on their
preference towards items, and recommends items
which are highly rated(preferred) by similar users.
• Content based Recommendation(CBR) : Content
based system provides recommendation directly based
on similarity of items and the user history . Similarity is
computed based on item attributes using appropriate
distance measures.
228. Collaborative Filtering (CF)
– Provide recommendations solely based on preferences
expressed between users and items
– “People who watched this also watched that”
• Content-based Recommendations (CBR)
– Provide recommendations based on the attributes of the
items (and user profile)
– ‘Chak de India’ is a sports movie, Mohan likes sports
movies
• => Suggest Chak de India to Mohan
• Mahout geared towards CF, can be extended to do CBR
• Aside: search engines can also solve these problems
232. Component diagram
Core platform DB
QDSS (Cassandra)
DataModel Extractor /
User Preference Builder
Feeder
Recommender App
REST API
Personalization
DB
233. Key architecture blocks
• Data Extraction
– This involves extracting relevant data from :
• Core Platform DB (MySQL)
• Qyuki custom Data store (Cassandra)
• Data Mining :
– This involves a series of activities to calculate
distances/similarities between items, preferences of users
and applying Machine Learning for further use cases.
• Personalization Service:
– The application of relevant data contextually for the users
and creations on Qyuki, exposed as a REST api.
234. Entities to Recommend : Creations
For every creation, there are 2 parts to
recommendations :
1) Creations similar to the current creation(ContentBased)
2) Recommendations based on the studied
preferences of the user.
235. Features cont..
Entities to Recommend : Users
• New users/creators to follow based on his
social graph & other activities
• New users/creators to collaborate with.
– The engine would recommend artists to artists to
collaborate and create.
– QInfluence
236. Personalized Creations
• Track user’s activities
– MySQL + Custom Analytics(Cassandra)
• Explicit & Implicit Ratings
• Explicit : None in our case
• Implicit : Emotes, comments, views,
engagement with creator
237. Personalized Creations cont..
• Mahout expects data input in the form :
User
Item
Preference
U253
I306
3.5
U279
I492
4.0
• Preference Compute Engine
– Consider all user activities, infer preferences
• Feed preference data to Mahout, you are done(to some extent)
• Filtering and rescoring
238. Some Mahout recommendations logic/code
• DataModel : preference data model
– DataModel dataModel = new FileDataModel(new File(preferenceFile));
• Notion of similarity(Distance measure)
– UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
• Neighbourhood(for user based)
– UserNeighborhood neighborhood = NearestNUserNeighborhood(10,
similarity, dataModel);
• Recommender instance
–
GenericUserBasedRecommender userBasedRecommender = new
GenericUserBasedRecommender(dataModel, neighborhood, similarity);
• Recommendations for a user
– topUserBasedRecommendeditems =
userBasedRecommender.recommend(userId, numOfItemsToBeCached, null);
239. Similarity algorithms
• Pearson correlation–based similarity
– Pearson correlation is a number between –1 and
1
– It measures the tendency of two users’ preference
values to move together—to be relatively high, or
relatively low, on the same items.
– doesn’t take into account the number of items in
which two users’ preferences overlap
240. Similarity algorithms
• Cosine Similarity
– Ignores 0-0 matches
– A measure of alignment/direction
– Since Cos 0 = 1 (1 means 100% similar)
• Euclidean Distance Similarity
– Based on Euclidean distance between two points
(sq root(sum of squares of diff in coordinates)
241. Which Similarity to use
• If your data is dense (almost all attributes have
non-zero values) and the magnitude of the
attribute values is important, use distance
measures such as Euclidean.
• If the data is subject to grade-inflation (different
users may be using different scales) use Pearson.
• If the data is sparse consider using Cosine
Similarity.
242. Hybrid approaches
• Weighted hybrid – Combines scores from each
component using linear formula.
• Switching hybrid – Select one recommender
among candidates.
• Mixed hybrid – Based on the merging and
presentation of multiple ranked lists into one.
– Core algorithm of mixed hybrid merges them into a
single ranked list.
245. Movie recommendations Case
• Movie rating data (17 million ratings from
around 6000 users available as input)
• Goal : Compute recommendations for users
based on the rating data
• Use Mahout over Hadoop
• Mahout has support to run Map Reduce Jobs
• Runs a series of Map Reduce Jobs to compute
recommendations (Movies a user would like
to watch)
278
246. Run Mahout job on
• hadoop jar ~/mahout-core-0.7-job.jar
org.apache.mahout.cf.taste.hadoop.item.Reco
mmenderJob --input
/user/hadoop/input/ratings.csv --output
recooutput -s SIMILARITY_COOCCURRENCE
--usersFile /user/hadoop/input/users.txt
247. Other similarity options
• You can pass one of these as well to the
• “-s” flag :
• SIMILARITY_COOCCURRENCE,
SIMILARITY_EUCLIDEAN_DISTANCE,
SIMILARITY_LOGLIKELIHOOD,
SIMILARITY_PEARSON_CORRELATION,
SIMILARITY_TANIMOTO_COEFFICIENT,
SIMILARITY_UNCENTERED_COSINE,
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSIN
E
280
248. Learning
•
Reach out to experienced folks early
•
Recommending follows the 90/10 rule
•
Assign the right task to the right person
•
Use Analytics metrics to evaluate recommenders periodically
•
Importance to human evaluation
•
Test thoroughly
•
Mahout has some bugs(e.g. filtering part)
•
Experiment with hybrid approaches
•
You won’t need Hadoop generally
•
Performance test your application
•
Have a fallback option, hence take charge of your popularity algorithm
249. What is recommendation?
Recommendation involves the prediction of what
new items a user would like or dislike based on
preferences of or associations to previous items
(Made-up) Example:
A user, John Doe, likes the following books (items):
A Tale of Two Cities
The Great Gatsby
For Whom the Bell Tolls
Recommendations will predict which new books
(items), John Doe, will like:
Jane Eyre
The Adventures of Tom Sawyer
282
250. How does Mahout’s Recommendation
Engine Work?
5
3
4
4
2
2
1
2
40
3
3
3
2
1
1
0
0
18.5
4
3
4
3
1
2
0
0
24.5
4
2
3
4
2
2
1
2
1
1
2
2
1
1
4.5
26
2
1
2
2
1
2
0
0
16.5
1
0
0
1
1
0
1
5
15.5
U
R
X
S
S is the similarity matrix between items
U is the user’s preferences for items
R is the predicted recommendations
4
=
40
251. Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
What is the similarity matrix, S?
Item 1
5
3
4
4
2
2
1
Item 2
3
3
3
2
2
1
0
Item 3
4
3
4
3
1
2
0
Item 4
4
2
3
4
2
2
1
Item 5
2
2
1
2
2
1
1
Item 6
2
1
2
2
1
2
0
Item 7
1
0
0
1
1
0
S is a n x n (square) matrix
Each element, e, in S are indexed
by row (j) and column (k), ejk
Each ejk in S holds a value that
describes how similar are its
corresponding j-th and k-th items
In this example, the similarity of
the j-th and k-th items are
determined by frequency of their
co-occurrence (when the j-th item
is seen, the k-th item is seen as
well)
In general, any similarity measure
may be used to produce these
values
We see in this example that
1
Items 1 and 2 co-occur 3 times,
Items 1 and 3 co-occur 4 times,
and so on…
S
284
252. What is the user’s preferences, U?
Item 1
Item 2
0
Item 3
0
Item 4
4
Item 5
4.5
Item 6
0
Item 7
The user’s preference
is represented as a
column vector
2
5
Each value in the vector
represents the user’s
preference for j-th item
In general, this column
vector is sparse
Values of zero, 0,
represent no recorded
preferences for the j-th
item
U
285
253. What is the recommendation, R?
Item 1
R is a column vector
representing the prediction
of recommendation of the jth item for the user
R is computed from the
multiplication of S and U
40
Item 2 18.5
Item 3 24.5
Item 4
40
Item 5
26
SxU=R
In this running example, the
user already has expressed
positive preferences for Items
1, 4, 5 and 7, so we look at
only Items 2, 3, and 6
We would recommend to the
user Items 3, 2, and 6, in this
order, to the user
Item 6 16.5
Item 7 15.5
R
286
254. What data format does Mahout’s
recommendation engine expects?
• For Mahout v0.7, look
at RecommenderJob
Format 1
123,345
123,456
123,789
…
789,458
(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob)
• Each line of the input file
should have the following
format
– userID,itemID[,preferencevalue
]
Format 2
• userID is parsed as a long
• itemID is parsed as a long
• preferencevalue is parsed as a
double and is optional
123,345,1.0
123,456,2.2
123,789,3.4
…
789,458,1.2
287
255. How do you run Mahout’s
recommendation engine?
Requirements
Hadoop cluster on GNU/Linux
Java 1.6.x
SSH
Assuming you have a Hadoop cluster installed and configured
correctly with the data loaded into HDFS,
hadoop jar ~/mahout-core-0.7-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input
/user/hadoop/input/ratings.csv --output recooutput -s SIMILARITY_COOCCURRENCE -usersFile /user/hadoop/input/users.txt
$HADOOP_INSTALL$ is the location where you installed Hadoop
$TARGET$ is the directory where you have the Mahout jar file
$INPUT$ is the input file name
$OUTPUT$ is the output file name
288
256. Running Mahout RecommenderJob options
•
•
•
•
•
•
•
There are plenty of runtime options (check javadocs)
--userFile (path) : optional; a file containing userIDs; only preferences of these
userIDs will be computed
--itemsFile (path) : optional; a file containing itemIDs; only these items will be used
in the recommendation predictions
--numRecommendations (integer) : number of recommendations to compute per
user; default 10
--booleanData (boolean) : treat input data as having no preference values; default
false
--maxPrefsPerUser (integer) : maximum number of preferences considered per
user in final recommendation phase; default 10
--similarityClassname (classname): similarity measure (cooccurence, euclidean,
log-likelihood, pearson, tanimoto coefficient, uncentered cosine, cosine)
289
257. Coordination in a distributed system
• Coordination: An act that multiple nodes must
perform together.
• Examples:
– Group membership
– Locking
– Publisher/Subscriber
– Leader Election
– Synchronization
• Getting node coordination correct is very hard!
258. Introducing ZooKeeper
ZooKeeper allows distributed processes to
coordinate with each other through a shared
hierarchical name space of data registers.
- ZooKeeper Wiki
ZooKeeper is much more than a
distributed lock server!
259. What is ZooKeeper?
• An open source, high-performance
coordination service for distributed
applications.
• Exposes common services in simple interface:
– naming
– configuration management
– locks & synchronization
– group services
… developers don't have to write them from scratch
• Build your own on it for specific needs.
260. ZooKeeper Use Cases
• Configuration Management
– Cluster member nodes bootstrapping configuration from a
centralized source in unattended way
– Easier, simpler deployment/provisioning
• Distributed Cluster Management
– Node join / leave
– Node statuses in real time
•
•
•
•
Naming service – e.g. DNS
Distributed synchronization - locks, barriers, queues
Leader election in a distributed system.
Centralized and highly reliable (simple) data registry
261. The ZooKeeper Service
•
•
•
•
•
ZooKeeper Service is replicated over a set of machines
All machines store a copy of the data (in memory)
A leader is elected on service startup
Clients only connect to a single ZooKeeper server & maintains a TCP connection.
Client can read from any Zookeeper server, writes go through the leader & needs
majority consensus.
Image: https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription
262. The ZooKeeper Data Model
• ZooKeeper has a hierarchal name space.
• Each node in the namespace is called as a ZNode.
• Every ZNode has data (given as byte[]) and can optionally
have children.
parent : "foo"
|-- child1 : "bar"
|-- child2 : "spam"
`-- child3 : "eggs"
`-- grandchild1 : "42"
• ZNode paths:
– canonical, absolute, slash-separated
– no relative references.
– names can have Unicode characters