CityLABS Workshop: Working with large tables

Working with large tables: Big Data processing and analytics
Enrico Daga - enrico.daga@open.ac.uk - @enridaga
Ilaria Tiddi - ilaria.tiddi@open.ac.uk - @CityLabsProject
Understanding Your Data: From Collection To Effective Analytics
A CityLABS Workshop
12 June 2018 - Knowledge Media Institute, The Open University

• To
introduce
the
concept
of
distributed
computing

• To
show
how
we
use
the
MK
Data
Hub
Cluster
for

processing
large
datasets

• To
taste
state
of
the
art
tools
for
data
processing

• To
understand
the
difference
with
more
traditional

approaches
(e.g.
Relational
Data
Warehouse)
Objective

• Tabular
data

• Distributed
computing

• Hadoop

• Big
Data
Cluster

• Hue,
Hive,
PIG

• Hands-‐On
Outline

Tabular
data
• Many
different
types
of
data
objects
are
tables
or
can
be
translated
and

manipulated
as
data
tables

• Excel
Documents,
Relational
databases
-‐>
Tables

• Text
Documents
-‐>
Word
Vectors
-‐>
Tables

• Web
Data
-‐>
Graph
-‐>
Tables

• JSON
-‐>
Tree
-‐>
Graph
-‐>
Tables

• …

Tables
can
be
large
• Web
Server
Logs

• Thousands
each
day
even
for
a
small
Web
site,
Billion

for
large

• Social
Media

• 500M
of
twits
every
day

• Search
Engines

• Based
on
word
/
document
statistics
…

• Google
Indexes
contain
hundreds
of
billions
of

documents

Many
other
cases:

• Stock
Exchange

• Black
Boxes

• Power
Grid

• Transport

• …

Tables
can
be
large
• Most
operations
on
tabular
data
require
to
scan
all
the
rows

in
the
table:

• Filter,
Count,
MIN,
MAX,
AVG,
…

• One
example:
Computing
TF/IDF:
https://en.wikipedia.org/wiki/Tf-‐idf
“In
information
retrieval,
tf–idf
or
TFIDF,
short
for
term

frequency–inverse
document
frequency,
is
a
numerical
statistic

that
is
intended
to
reflect
how
important
a
word
is
to
a

document
in
a
collection
or
corpus.”

Distributed
computing
• An
approach
based
on
the
distribution
of
data
and
the

parallelisation
of
operations

• Data
is
replicated
over
a
number
of
redundant
nodes

• Computation
is
segmented
over
a
number
of
workers

• to
retrieve
data
from
each
node

• to
perform
atomic
operations

• to
compose
the
result

Apache
Hadoop
• Open
Source
project
derived
from
Google’s

MapReduce.

• Use
multiple
disks
for
parallel
reads

• Keeps
multiple
copies
of
the
data
for
fault
tolerance

• Applies
MapReduce
to
split/merge
the
processing
in

several
workers
http://hadoop.apache.org/

MK
Data
Hub
Cluster
HDFS

Hadoop
Distributed
File
System
Hadoop
Map
Reduce
Libraries
HIVE PIG
HCatalog
Zookeeper,
YARN,
…
Cloudera
Open
Source
HUE
Workbench
SPARK
HBase
A
private
environment
for
large
scale
data
processing
and
analytics

HUE
• A
user
interface
over
most
Hadoop
tools

• Authentication

• HDFS
Browsing

• Data
download
and
upload

• Job
monitoring

Apache
HIVE
• A
data
warehouse
over
Hadoop/HDFS

• A
query
language
similar
to
SQL

• Allows
to
create
SQL-‐like
tables
over
files
or
HBase
tables

• Naturally
views
several
files
as
single
table

• HiveQL
has
almost
all
the
operators
that
developers
familiar

with
SQL
know

• Applies
MapReduce
underneath
https://hive.apache.org/

Apache
Pig
• Originally
developed
at
Yahoo
Research
around
2006

• A
full
fledged
ETL
language
(Pig
Latin)

• Load/Save
data
from/to
HDFS

• Iterate
over
data
tuples

• Arithmetic
operations

• Relational
operations

• Filtering,
ordering,
etc…

• Applies
MapReduce
underneath
https://pig.apache.org/

Caveat
• Read
/
Write
operations
to
disk
are
slow
and

cost
resources

• Reading
and
merging
from
multiple
files
is

expensive

• Hardware,
file
system,
I/O
errors

Caveat
• Relational
database
design
principles
are
NOT

recommended,
e.g.:

• Integrity
constraints

• De-‐duplication

• MapReduce
is
inefficient
per
definition!

• Bad
at
managing
transactions

• Heavy
work
even
for
very
simple
queries

Hands-‐On
• Gutenberg
project

• Public
domain
books

• ~50k
books
in
English,
~2
billion
words

• Context:
build
a
specialised
search
engine
over
the

Gutenberg
project

• Task:
Compute
TF/IDF
of
these
books
http://www.gutenberg.org/

Computing
TF-‐IDF
• TF:
term
frequency

• Sum
of
term
hits
adjusted
for
doc
length

• tf(t,d)
=
count(t,d)
/
len(d)

• {doc,”cat”,hits=5,len=2000}
=
0.0025

• IDF:
inverse
document
frequency

• N
=
all
documents
(D)

• divided
by
the
documents
having
term

• in
log
scale

• We
can’t
do
this
easily
with
a
laptop
…

• e.g.
Gutenberg
English
results
in
~1.5
billion
terms
https://en.wikipedia.org/wiki/Tf-‐idf
https://en.wikipedia.org/wiki/Zipf%27s_law

Step
1/4
-‐
Generate
Term
Vectors
gutenberg_docs
doc_id text
Gutenberg-‐1 …
Gutenberg-‐2 …
Gutenberg-‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-‐1 0 note[VBP]
Gutenberg-‐1 1 file[NN]
Gutenberg-‐1 2 combine[VBZ]
…
Natural
Language
Processing
task:

-‐ Remove
common
words
(the,
of,
for,
…)

-‐ Part
of
Speech
tagging
(Verb,
Noun,
…)

-‐ Stemming
(going
-‐>
go)

-‐ Abstract
(12,
1.000,
20%
-‐>
<NUMBER>)
Lookup
book
Gutenberg-‐11800
as
follows:

http://www.gutenberg.org/ebooks/11800

Step
2/4
Compute
Terms
Frequency
(TF)
gutenberg_terms
doc_id position WORD
Gutenberg-‐1 1 file[NN]
Gutenberg-‐1 2 combine[VBZ]
…
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-‐1 call[VB] 2
Gutenberg-‐1 world[NN] 22
Gutenberg-‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-‐1 call[VB] 1.791697274828445E-‐5
Gutenberg-‐1 world[NN] 1.791697274828445E-‐5
Gutenberg-‐1 combine[VBZ] 8.958486374142224E-‐6
…
tf(t,d)
=
count(t,d)
/
len(d)
count(t,d)
len(d) count(t,d)
/

len(d)
…
for
each
term
in
each
doc
…

Step
3/4
Compute
Inverse
Document
Frequency
(IDF)
term_usages
+ num_docs_with_term
+ 1234
+ 1234
1234
term_freqs
doc_id term term_freq
Gutenberg-‐1 call[VB] 1.791697274828445E-‐5
Gutenberg-‐1 world[NN] 1.791697274828445E-‐5
Gutenberg-‐1 combine[VBZ] 8.958486374142224E-‐6
…
term_usages_idf
doc_id term term_freq idf
Gutenberg-‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
d
log(48790/d)
…
for
each
term
in
each
doc
…

Step
4/4
Compute
TF/IDF
(IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-‐5307 will[MD] 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.11554635054423526
…
…
for
each
term
in
each
doc.
term_freq
*
if

Let’s
go
• Connect
to
The_Cloud

• https://workshop.bigdata.kmi.org

• HTTPS
User:
citylabsX

Password:
MiltonKeynesX

where
X
is
your
group
number
1,2,3,4,5

• HUE

User:
citylabs-‐workshop

Password:
IH31i>kh

• (India
Hotel
3
1
india
>
kilo
hotel)

Follows
on
the
Github
Workshop
page:

https://github.com/andremann/DataHub-‐workshop/tree/master/
Working-‐with-‐large-‐tables

Summary
• We
introduced
the
notion
of
distributed
computing

• We
have
shown
how
to
process
large
datasets

• You
tasted
state
of
the
art
tools
for
data
processing

using
the
MK
DataHub
Hadoop
Cluster

• We
experienced
how
to
compute
TF/IDF
on
a
corpus

of
documents

with
HIVE
and
PIG

Thank
you!

CityLABS Workshop: Working with large tables

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CityLABS Workshop: Working with large tables

Similar to CityLABS Workshop: Working with large tables (20)

More from Enrico Daga

More from Enrico Daga (18)

Recently uploaded

Recently uploaded (20)

CityLABS Workshop: Working with large tables