This document introduces distributed computing and tools for processing large tabular data using the Big Data Cluster. It discusses how distributed computing allows tabular data to be replicated across nodes and computation to be parallelized. It then provides an overview of Hadoop and how the Big Data Cluster can be used with tools like Hue, Hive, and Pig to perform analytics on large datasets. Finally, it walks through an example of computing TF-IDF scores on a corpus of text documents from Project Gutenberg.
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
OU RSE Tutorial Big Data Cluster
1. Working with large tables:
processing and analytics with the Big Data Cluster
Enrico Daga
enrico.daga@open.ac.uk - @enridaga
Knowledge Media Institute - The Open University
http://isds.kmi.open.ac.uk/
OU Research Software Engineers - October 2018
2. enrico.daga@open.ac.uk - @enridaga
Objective
• To introduce the concept of distributed computing
• To show how to use the Big Data Cluster
• To taste some tools for data processing
• To understand the difference with more traditional
approaches (e.g. Relational Data Warehouse)
5. enrico.daga@open.ac.uk - @enridaga
Tabular data
Many
different
types
of
data
objects
are
tables
or
can
be
translated
and
manipulated
as
data
tables
• Excel
Documents,
Relational
databases
-‐>
Tables
• Text
Documents
-‐>
Word
Vectors
-‐>
Tables
• Web
Data
-‐>
Graph
-‐>
Tables
• JSON
-‐>
Tree
-‐>
Graph
-‐>
Tables
• …
6. enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Web
Server
Logs
• Thousands
each
day
even
for
a
small
Web
site,
Billion
for
large
• Social
Media
• 500M
of
twits
every
day
• Search
Engines
• Based
on
word
/
document
statistics
…
• Google
Indexes
contain
hundreds
of
billions
of
documents
Many
other
cases:
• Stock
Exchange
• Black
Boxes
• Power
Grid
• Transport
• …
7. enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Most
operations
on
tabular
data
require
to
scan
all
the
rows
in
the
table:
• Filter,
Count,
MIN,
MAX,
AVG,
…
• One
example:
Computing
TF/IDF:
https://en.wikipedia.org/wiki/Tf-‐idf
“In
information
retrieval,
tf–idf
or
TFIDF,
short
for
term
frequency–inverse
document
frequency,
is
a
numerical
statistic
that
is
intended
to
reflect
how
important
a
word
is
to
a
document
in
a
collection
or
corpus.”
8. enrico.daga@open.ac.uk - @enridaga
Distributed computing
• An approach based on the distribution of data and the
parallelisation of operations
• Data is replicated over a number of redundant nodes
• Computation is segmented over a number of workers
• to retrieve data from each node
• to perform atomic operations
• to compose the result
10. enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
• Open Source project derived from Google’s MapReduce.
• Use multiple disks for parallel reads
• Keeps multiple copies of the data for fault tolerance
• Applies MapReduce to split/merge the processing in several
workers
http://hadoop.apache.org/
12. enrico.daga@open.ac.uk - @enridaga
KMi Big Data Cluster
A private environment for large scale data processing and analytics.
HDFS
Hadoop
Distributed
File
System
Hadoop
Map
Reduce
Libraries
HIVE PIG
HCatalog
Zookeeper,
YARN,
…
Cloudera
Open
Source
HUE
Workbench
SPARK
HBase
https://www.cloudera.com/products/open-‐source.html
13. enrico.daga@open.ac.uk - @enridaga
HUE
• A user interface over most Hadoop tools
• Authentication
• HDFS Browsing
• Data download and upload
• Job monitoring
http://gethue.com/
14. enrico.daga@open.ac.uk - @enridaga
Apache HIVE
• A data warehouse over Hadoop/HDFS
• A query language similar to SQL
• Allows to create SQL-like tables over files or HBase tables
• Naturally views several files as single table
• HiveQL has almost all the operators that developers
familiar with SQL know
• Applies MapReduce underneath
https://hive.apache.org/
15. enrico.daga@open.ac.uk - @enridaga
Apache Pig
• Originally developed at Yahoo Research around 2006
• A full fledged ETL language (Pig Latin)
• Load/Save data from/to HDFS
• Iterate over data tuples
• Arithmetic operations
• Relational operations
• Filtering, ordering, etc…
• Applies MapReduce underneath
16. enrico.daga@open.ac.uk - @enridaga
Caveat
• Read / Write operations to disk are slow and cost resources
• Reading and merging from multiple files is expensive
• Hardware, file system, I/O errors
17. enrico.daga@open.ac.uk - @enridaga
Caveat
• Relational database design principles are NOT recommended,
e.g.:
• Integrity constraints
• De-duplication
• MapReduce is inefficient per definition!
• Bad at managing transactions
• Heavy work even for very simple queries
18. enrico.daga@open.ac.uk - @enridaga
Hands-On!
• Gutenberg project
• Public domain books
• ~50k books in English, ~2 billion words
• Context: build a specialised search engine over the Gutenberg
project
• Task: Compute TF/IDF of these books
http://www.gutenberg.org/
19. enrico.daga@open.ac.uk - @enridaga
Computing TF-IDF
• TF: term frequency
• Sum of term hits adjusted for doc length
• tf(t,d) = count(t,d) / len(d)
• {doc,”cat”,hits=5,len=2000} = 0.0025
• IDF: inverse document frequency
• N = all documents (D)
• divided by the documents having term
• in log scale
• We can’t do this easily with a laptop …
• e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-‐idf
20. enrico.daga@open.ac.uk - @enridaga
Step 1/4 - Generate Term Vectors
Natural
Language
Processing
task:
-‐ Remove
common
words
(the,
of,
for,
…)
-‐ Part
of
Speech
tagging
(Verb,
Noun,
…)
-‐ Stemming
(going
-‐>
go)
-‐ Abstract
(12,
1.000,
20%
-‐>
<NUMBER>)
gutenberg_docs
doc_id text
Gutenberg-‐1 …
Gutenberg-‐2 …
Gutenberg-‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-‐1 0 note[VBP]
Gutenberg-‐1 1 file[NN]
Gutenberg-‐1 2 combine[VBZ]
…
Lookup
book
Gutenberg-‐11800
as
follows:
http://www.gutenberg.org/ebooks/11800
21. enrico.daga@open.ac.uk - @enridaga
Step 2/4 Compute Terms Frequency (TF)
tf(t,d)
=
count(t,d)
/
len(d)gutenberg_terms
doc_id position WORD
Gutenberg-‐1 0 note[VBP]
Gutenberg-‐1 1 file[NN]
Gutenberg-‐1 2 combine[VBZ]
…
Gutenberg-‐1 5425 note[VBP]
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-‐1 call[VB] 2
Gutenberg-‐1 world[NN] 22
Gutenberg-‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-‐1 call[VB] 1.791697274828445E-‐5
Gutenberg-‐1 world[NN] 1.791697274828445E-‐5
Gutenberg-‐1 combine[VBZ] 8.958486374142224E-‐6
…
count(t,d)
len(d) count(t,d)
/
len(d)
…
for
each
term
in
each
doc
…
22. enrico.daga@open.ac.uk - @enridaga
Step 3/4 Compute Inverse Document Frequency (IDF)
term_usages
+ num_docs_with_term
+ 11234
+ 5436
3987
term_freqs
doc_id term term_freq
Gutenberg-‐1 call[VB] 1.791697274828445E-‐5
Gutenberg-‐1 world[NN] 1.791697274828445E-‐5
Gutenberg-‐1 combine[VBZ] 8.958486374142224E-‐6
…
count
doc_id
having
term
term_usages_idf
doc_id term term_freq idf
Gutenberg-‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
log(48790/d)
N
=
48790
24. enrico.daga@open.ac.uk - @enridaga
Let’s go …
• Step by step instructions at the following link:
• https://github.com/andremann/DataHub-workshop/tree/master/
Working-with-large-tables
25. enrico.daga@open.ac.uk - @enridaga
Summary
• We introduced the notion of distributed computing
• We have shown how to process large datasets
• You tasted state of the art tools for data processing
using the MK DataHub Hadoop Cluster
• We experienced how to compute TF/IDF on a corpus of
documents with HIVE and PIG