Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DASFAA 2015 Hanoi Tutorial
Scalable Learning Technologies
for Big Data Mining
Gerard de Melo, Tsinghua University
http://g...
Big DataBig DataBig DataBig Data
Image: Caixin
Image: Corbis
Alibaba: 31 million orders
per day! (2014)
Alibaba: 31 millio...
Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web
Source: Coup Media 2013
Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web
Source: Coup Media 2013
From Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to Knowledge
Image:
Brett Ryder
Learning from DataLearning from Data
Previous
Knowledge
Preparation
Time
Passed
Exam?
Student 1 80% 48h Yes
Student 2 50% ...
Learning from DataLearning from Data
Previous
Knowledge
Preparation
Time
Passed
Exam?
Student 1 80% 48h Yes
Student 2 50% ...
Learning from DataLearning from Data
Previous
Knowledge
Preparation
Time
Passed
Exam?
Student 1 80% 48h Yes
Student 2 50% ...
Machine LearningMachine Learning
Labels
for Test
Data
PredictionPrediction
Probably
Spam!
ClassifierModel
Unsupervised
or ...
Unsupervised
or Supervised
Learning
Unsupervised
or Supervised
Learning
Data MiningData Mining
Labels
for Test
Data
Classi...
Problem with Classic Methods:Problem with Classic Methods:
ScalabilityScalability
Problem with Classic Methods:Problem wit...
Scaling UpScaling Up
Scaling Up: More FeaturesScaling Up: More Features
Previous
Knowledge
Preparation
Time
Passed
Exam?
Student 1 80% 48h Yes
...
Scaling Up: More FeaturesScaling Up: More Features
Previous
Knowledge
Preparation
Time
... ... ... ... ... Passed Exam?
St...
Scaling Up: More FeaturesScaling Up: More Features
Previous
Knowledge
Preparation
Time
... ... ... ... ... Passed Exam?
St...
Scaling Up: More FeaturesScaling Up: More Features
Previous
Knowledge
Preparation
Time
... ... ... ... ... Passed Exam?
St...
Scaling Up: More FeaturesScaling Up: More Features
F0 F1 F2 F3 ... ... Fn Passed Exam?
Student 1 ... ... ... ... ... ... ....
Scaling Up: More Training ExamplesScaling Up: More Training Examples
Banko & Brill (2001): Word confusion experiments
(e.g...
Scaling Up: More Training ExamplesScaling Up: More Training Examples
Banko & Brill (2001): Word confusion experiments
(e.g...
Scaling Up: More Training ExamplesScaling Up: More Training Examples
Léon Bottou. Learning with Large Datasets Tutorial.
T...
Background:
Stochastic Gradient Descent
Background:
Stochastic Gradient Descent
Images:http://en.wikipedia.org/wiki/File:H...
Scaling Up: More Training ExamplesScaling Up: More Training Examples
Recommended Tool
VowPal Wabbit
By John Langford et al...
Scaling Up: More Training ExamplesScaling Up: More Training Examples
Parallelization?Parallelization?
Use lock-free
approa...
Problem:
Where to get Training Examples?
Problem:
Where to get Training Examples?
Gerard de Melo
Labeled Data is expensive...
Semi-Supervised
Learning
Semi-Supervised
Learning
● Goal: When learning a model, use unlabeled
data in addition to labeled data
● Example: Cluster-and-label
● Run a cluster...
● Bootstrapping or Self-Training
(e.g. Yarowsky 1995)
– Use classifier to label
the unlabelled examples
– Add the
labels w...
● Co-Training (Blum & Mitchell 1998)
● Given:
multiple (ideally independent) views
of the same data
(e.g. left context and...
Semi-Supervised Learning:
Transductive Setting
Semi-Supervised Learning:
Transductive Setting
Image: Partha Pratim Talukdar
Semi-Supervised Learning:
Transductive Setting
Semi-Supervised Learning:
Transductive Setting
Image: Partha Pratim Talukda...
● Sentiment Analysis:
● Look for Twitter tweets with emoticons like “:)”, “:(“
● Remove emoticons. Then use as training da...
Representation
Learning to Better
Exploit Big Data
Representation
Learning to Better
Exploit Big Data
RepresentationsRepresentationsRepresentationsRepresentations
Image: David Warde-Farley via Bengio et al. Deep Learning Book
RepresentationsRepresentationsRepresentationsRepresentations
Inputs Bits:
0011001…..
Images: Marc'Aurelio Ranzato
Note sha...
RepresentationsRepresentationsRepresentationsRepresentations
Inputs Bits:
0011001…..
Images: Marc'Aurelio Ranzato
Massive ...
ExampleExampleExampleExample
Google's image
Source: Jeff Dean, Google
Inspiration: The BrainInspiration: The BrainInspiration: The BrainInspiration: The Brain
Source: Alex Smola
Input: deliver...
PerceptronPerceptronPerceptronPerceptron
Input: Features
Every feature fi
gets a weight wi
.
Input: Features
Every feature...
PerceptronPerceptronPerceptronPerceptron
Neuron Output
w1
w2
w3
w4
Activation of Neuron
Multiply the feature values
of an ...
PerceptronPerceptronPerceptronPerceptron
Neuron Output
w1
w2
w3
w4
Output of Neuron
Check if activation exceeds
a threshol...
Decision SurfacesDecision SurfacesDecision SurfacesDecision Surfaces
Decision Trees Linear Classifiers
(Perceptron, SVM)
K...
Deep Learning:Deep Learning:
Multi-Layer PerceptronMulti-Layer Perceptron
Deep Learning:Deep Learning:
Multi-Layer Percept...
Deep Learning:Deep Learning:
Multi-Layer PerceptronMulti-Layer Perceptron
Deep Learning:Deep Learning:
Multi-Layer Percept...
Deep Learning:Deep Learning:
Multi-Layer PerceptronMulti-Layer Perceptron
Deep Learning:Deep Learning:
Multi-Layer Percept...
Deep Learning:Deep Learning:
Multi-Layer PerceptronMulti-Layer Perceptron
Deep Learning:Deep Learning:
Multi-Layer Percept...
Deep Learning:Deep Learning:
Multi-Layer PerceptronMulti-Layer Perceptron
Deep Learning:Deep Learning:
Multi-Layer Percept...
Deep Learning:Deep Learning:
Computing the OutputComputing the Output
Deep Learning:Deep Learning:
Computing the OutputCom...
Deep Learning:Deep Learning:
TrainingTraining
Deep Learning:Deep Learning:
TrainingTraining
Compute error
on output,
if no...
Exploit the chain rule to compute
the gradient
Deep Learning:Deep Learning:
TrainingTraining
Deep Learning:Deep Learning:
...
DropOut TechniqueDropOut TechniqueDropOut TechniqueDropOut Technique
Basic Idea
While training, randomly drop inputs (make...
Deep Learning:Deep Learning:
Convolutional Neural NetworksConvolutional Neural Networks
Deep Learning:Deep Learning:
Convo...
Deep Learning:Deep Learning:
Recurrent Neural NetworksRecurrent Neural Networks
Deep Learning:Deep Learning:
Recurrent Neu...
Deep Learning:Deep Learning:
Recurrent Neural NetworksRecurrent Neural Networks
Deep Learning:Deep Learning:
Recurrent Neu...
Deep Learning:Deep Learning:
Long Short Term Memory NetworksLong Short Term Memory Networks
Deep Learning:Deep Learning:
L...
Deep Learning:Deep Learning:
Long Short Term Memory NetworksLong Short Term Memory Networks
Deep Learning:Deep Learning:
L...
Deep Learning:Deep Learning:
Long Short Term Memory NetworksLong Short Term Memory Networks
Deep Learning:Deep Learning:
L...
Deep Learning:Deep Learning:
Neural Turing MachinesNeural Turing Machines
Deep Learning:Deep Learning:
Neural Turing Machi...
Deep Learning:Deep Learning:
Neural Turing MachinesNeural Turing Machines
Deep Learning:Deep Learning:
Neural Turing Machi...
Deep Learning:Deep Learning:
Neural Turing MachinesNeural Turing Machines
Deep Learning:Deep Learning:
Neural Turing Machi...
Deep Learning:Deep Learning:
Neural Turing MachinesNeural Turing Machines
Deep Learning:Deep Learning:
Neural Turing Machi...
Deep Learning:Deep Learning:
Neural Turing MachinesNeural Turing Machines
Deep Learning:Deep Learning:
Neural Turing Machi...
Deep Learning:Deep Learning:
Neural Turing MachinesNeural Turing Machines
Deep Learning:Deep Learning:
Neural Turing Machi...
Big Data in
Feature Engineering
and
Representation Learning
Big Data in
Feature Engineering
and
Representation Learning
● Language Models for Autocompletion
Web Semantics:
Statistics from Big Data as Features
Web Semantics:
Statistics from Bi...
Source: Wang et al. An Overview of Microsoft Web N-gram
Corpus and Applications
Word SegmentationWord Segmentation
● NP Coordination
Source: Bansal & Klein (2011)
Parsing: AmbiguityParsing: Ambiguity
Source: Bansal & Klein (2011)
Parsing: Web SemanticsParsing: Web Semantics
● Lapata & Keller (2004): The Web as a
Baseline (also: Bergsma et al. 2010)
● “big fat Greek wedding”
but not “fat Greek b...
Source: Bansal & Klein 2012
Coreference ResolutionCoreference Resolution
Source: Bansal & Klein 2012
Coreference ResolutionCoreference Resolution
● Data Sparsity:
E.g. most words are rare
(in the “long tail”)
→ Missing in training data
● Solution (Blitzer et al. 2006,...
Even worse: Arnold Schwarzenegger
Spelling CorrectionSpelling Correction
Vector RepresentationsVector Representations
x
x
x x
petronia
sparrow
parched
arid
xdry
xbird
Put words into
a vector spac...
Word Vector RepresentationsWord Vector Representations
Tomas Mikolov
et al.
Proc. ICLR 2013.
Available from https://code.g...
WikipediaWikipedia
● Exploit edit history, especially on
Simple English Wikipedia
● “collaborate” → “work together”
“stands for” → “is the sa...
Answering Questions
IBM's Jeopardy!-winning
Watson system
Gerard de Melo
Knowledge Integration
UWN/MENTA
multilingual extension of WordNet for
word senses and taxonomical information over 200 languages
www.lexvo.org/u...
WebChild
AAAI 2014
WSDM 2014
AAAI 2011
WebChild
AAAI 2014
WSDM 2014
AAAI 2011
WebChild:
Common-Sense Knowledge
WebChild:
C...
Challenge: From Really Big DataChallenge: From Really Big Data
to Real Insightsto Real Insights
Challenge: From Really Big...
Big Data Mining
in Practice
Big Data Mining
in Practice
Gerard de Melo (Tsinghua University, Bejing China)
Aparna Varde (Montclair State University, NJ, USA)
DASFAA, Hanoi,Vietna...
Dr. Aparna Varde
2
 Internet-based computing - shared resources, software &
data provided on demand, like the electricity grid
 Follows a p...
 Several technologies, e.g., MapReduce & Hadoop
 MapReduce: Data-parallel programming model for
clusters of commodity ma...
• Scalability
– To large data volumes
– Scan 100 TB on 1 node @ 50 MB/s = 24 days
– Scan on 1000-node cluster = 35 minutes...
Data type
key-value records
Map function
(Kin,Vin)  list(Kinter,Vinter)
Reduce function
(Kinter, list(Vinter))  list(...
MapReduce Example
the quick
brown
fox
the fox
ate the
mouse
how now
brown
cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
ho...
 40 nodes/rack, 1000-4000 nodes in cluster
 1 Gbps bandwidth in rack, 8 Gbps out of rack
 Node specs (Facebook):
8-16 c...
 Files split into 128MB blocks
 Blocks replicated across
several data nodes (often 3)
 Name node stores metadata
(file ...
 Hive:
Relational
D/B on
Hadoop
developed at
Facebook
 Provides
SQL-like
query
language
10
Supports table partitioning,
complex data types, sampling,
some query optimization
These help discover knowledge
by vari...
/* Find documents of enron table with word
frequencies within range of 75 and 80
*/
SELECT DISTINCT D.DocID
FROM docword_e...
/* Create a view to find the count for WordID=90
and docID=40, for the nips table */
CREATEVIEW Word_Freq
AS SELECT D.DocI...
/* Find documents which use word "rational" from nips
table */
SELECT D.DocID,V.word
FROM docword_Nips D JOIN vocabnips V
...
/* Find average frequency of all words in
the enron table */
SELECT AVG(count)
FROM docWord_enron;
OK
1.728152608060543
Ti...
Query Execution Time for HQL & MySQL on big data sets
Similar claims for other SQL packages
16
17
Server Storage Capacity
Max Storage per instance
18
 Hive supports rich data types: Map, Array & Struct; Complex types
 It supports queries with SQL Filters, Joins, Group B...
 Ensure dataset is already compliant with
integrity constraints before load
 Ensure that only compliant data rows are
lo...
Providing more power than Hive, Hadoop & MR
21
Cloudera’s Impala: More efficient SQL-
compliant analytic database
Hortonworks’ Stinger: Driving the future
of Hive with e...
 Fully integrated, state-of-the-art analytic
D/B to leverage the flexibility &
scalability of Hadoop
 Combines benefits
...
MPP: Massively Parallel Processing
 NDV: function for
counting
• Table w/ 1 billion rows
 COUNT(DISTINCT)
• precise answer
• slow for large-scale
data
 ND...
Hardware Configuration
• Generates less CPU load than Hive
• Typical performance gains: 3x-4x
• Impala cannot go faster t...
 No - Many viable use cases for MR
& Hive
• Long-running data transformation
workloads & traditional DW frameworks
• Comp...
Drive future of Hive with enterprise SQL at
Hadoop scale
3 main objectives
• Speed: Sub-second query response times
• Sc...
 Wider use cases with modifications to data
 BEGIN, COMMIT, ROLLBACK for multi-stmt transactions
29
 Hybrid engine with LLAP (Live Long and Process)
• Caching & data reuse across queries
• Multi-threaded execution
• High ...
Common Table Expressions
Sub-queries: correlated & uncorrelated
Rollup, Cube, Standard Aggregates
Inner, Outer, Semi &...
32
Supervised and Unsupervised Learning Algorithms
33
ML algorithms on distributed frameworks
good for mining big data on the cloud
Supervised Learning: e.g. Classification
...
 Clustering (Unsupervised)
• K-means, Fuzzy k-means, Streaming k-means etc.
 Classification (Supervised)
• Random Forest...
 Input: Big Data from Emails
• Goal: automatically classify text in various categories
 Prior to classifying text, TF-ID...
Training
Data
Pre-
Process
Training
Algorithm
Model
Building the Model
Historical data
with reference
decisions:
• Collect...
New
Data
Model Output
Using Model to Classify New Data
• Store a set of new
email documents
as text in HDFS on
an EC2 virt...
Text Classification - Results
Predicted Category: Hive
Input:
Output
:
Possible uses
• Automatic email
sorting
• Automatic...
 Lightning fast processing for Big Data
 Open-source cluster computing developed in
AMPLab at UC Berkeley
 Advanced DAG...
Spark runs much faster
than Hadoop & MR
100x faster in memory
10x faster on disk
42
 Uses TimSort (derived from merge-sort &
insertion-sort), faster than quick-sort
 Exploits cache locality due to in-memo...
 Spark Core: Distributed task dispatching, scheduling, basic I/O
 Spark SQL: Has SchemaRDD (Resilient Distributed Databa...
 MLBase - tools & interfaces to bridge gap b/w
operational & investigative ML
 Platform to support MLlib - the distribut...
 MLlib: Distributed ML library for classification,
regression, clustering & collaborative filtering
 MLI: API for featur...
 Classification
• Support Vector Machines (SVM), Naive Bayes, decision trees
 Regression
• linear regression, regression...
Easily interpretable
Ensembles are top performers
Support for categorical variables
Can handle missing data
Distribut...
Uses modified version of k-means
Feature extraction & selection
• Extraction: lot of time and tools
• Selection: domain ...
50
 High-D data: Not all features IMP to build
model & answer Qs
 Many applications: Reduce dimensions before
building mode...
52
53
 Grow into unified platform for data scientists
 Reduce time to market with platforms like
Glassbeam’s SCALAR for featur...
Processing Streaming Data on the Cloud
55
Apache Storm
• Reliably process unbounded streams, do for
real-time what Hadoop did for batch processing
Apache Flink
• ...
 Integrates queueing & D/B
 Nimbus node
• Upload computations
• Distribute code on cluster
• Launch workers on cluster
•...
 Tuple: ordered list of elements,
e.g., a “4-tuple” (7, 1, 3, 7)
 Stream: unbounded sequence of
tuples
 Spout: source o...
 Exploits in-memory
data streaming &
adds iterative
processing into
system
 Makes system
super fast for data-
intensive ...
 Requires few config
parameters
 Built-in optimizer finds
best way to run
program
 Supports all Hadoop
I/O & data types...
Summary and Ongoing Work
61
 MapReduce & Hadoop: Pioneering tech
 Hive: SQL-like, good for querying
 Impala: Complementary to Hive, overcomes
its d...
 Store & process Big Data?
• MR & Hadoop - Classical technologies
• Spark – Very large data, Fast & scalable
 Query over...
 Big Data Skills
• Cloud Technology
• Deep Learning
• Business Perspectives
• Scientific Domains
 Salary $100k +
• Big D...
 Big Data Concentration being
developed in CS Dept
• http://cs.montclair.edu/
• Data Mining, Remote Sensing, HCI,
Paralle...
 Include more cloud intelligence in big
data analytics
 Further bridge gap between Hive & SQL
 Add features from standa...
[1] M. Bansal, D. Klein. Coreference Semantics from Web Features. In Proceedings of
ACL 2012.
[2] R. Bekkerman, M. Bilenko...
[11] K. Hose, R. Schenkel, M.Theobald, G.Weikum: Database Foundations for
Scalable RDF Processing. Reasoning Web 2011, pp....
[20] N.Tandon, G. de Melo, G.Weikum. Deriving a Web-Scale Common Sense Fact
Database.In Proceedings of AAAI 2011.
[21] J.T...
Contact Information
Gerard de Melo (gdm@demelo.org) [http://gerard.demelo.org]
Aparna Varde (vardea@montclair.edu) [http:/...
Upcoming SlideShare
Loading in …5
×

Scalable Learning Technologies for Big Data Mining

1,188 views

Published on

These are slides of a tutorial by Gerard de Melo and Aparna Varde presented at the DASFAA 2015 conference.

As data expands into big data, enhanced or entirely novel data mining algorithms often become necessary. The real value of big data is often only exposed when we can adequately mine and learn from it. We provide an overview of new scalable techniques for knowledge discovery. Our focus is on the areas of cloud data mining and machine learning, semi-supervised processing, and deep learning. We also give practical advice for choosing among different methods and discuss open research problems and concerns.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scalable Learning Technologies for Big Data Mining

  1. 1. DASFAA 2015 Hanoi Tutorial Scalable Learning Technologies for Big Data Mining Gerard de Melo, Tsinghua University http://gerard.demelo.org Aparna Varde, Montclair State University http://www.montclair.edu/~vardea/ DASFAA 2015 Hanoi Tutorial Scalable Learning Technologies for Big Data Mining Gerard de Melo, Tsinghua University http://gerard.demelo.org Aparna Varde, Montclair State University http://www.montclair.edu/~vardea/
  2. 2. Big DataBig DataBig DataBig Data Image: Caixin Image: Corbis Alibaba: 31 million orders per day! (2014) Alibaba: 31 million orders per day! (2014)
  3. 3. Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web Source: Coup Media 2013
  4. 4. Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web Source: Coup Media 2013
  5. 5. From Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to Knowledge Image: Brett Ryder
  6. 6. Learning from DataLearning from Data Previous Knowledge Preparation Time Passed Exam? Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No
  7. 7. Learning from DataLearning from Data Previous Knowledge Preparation Time Passed Exam? Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No Student A 30% 20h ?
  8. 8. Learning from DataLearning from Data Previous Knowledge Preparation Time Passed Exam? Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No Student A 30% 20h ? Student B 80% 45h ?
  9. 9. Machine LearningMachine Learning Labels for Test Data PredictionPrediction Probably Spam! ClassifierModel Unsupervised or Supervised Learning Unsupervised or Supervised Learning D1 0.324 0.739 0.000 0.112 Data with or without labels
  10. 10. Unsupervised or Supervised Learning Unsupervised or Supervised Learning Data MiningData Mining Labels for Test Data ClassifierModel Data with or without labels World Data Acquisition Data Acquisition Raw Data Preprocessing + Feature Engineering Preprocessing + Feature Engineering VisualizationVisualization PredictionPrediction Use of new Knowledge Use of new Knowledge Model Data Acquisition Data Acquisition 0100101 0010110 1110011 Model D1 0.324 0.739 0.000 0.112 Unsupervised or Supervised Analysis Unsupervised or Supervised Analysis Analysis Results
  11. 11. Problem with Classic Methods:Problem with Classic Methods: ScalabilityScalability Problem with Classic Methods:Problem with Classic Methods: ScalabilityScalability http://www.whistlerisawesome.com/wp-content/uploads/2011/12/drinking-from-firehose.jpg
  12. 12. Scaling UpScaling Up
  13. 13. Scaling Up: More FeaturesScaling Up: More Features Previous Knowledge Preparation Time Passed Exam? Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No
  14. 14. Scaling Up: More FeaturesScaling Up: More Features Previous Knowledge Preparation Time ... ... ... ... ... Passed Exam? Student 1 80% 48h ... ... ... ... ... Yes Student 2 50% 75h ... ... ... ... ... Yes Student 3 95% 24h ... ... ... ... ... Yes Student 4 60% 24h ... ... ... ... ... No Student 5 80% 10h ... ... ... ... ... No For example: - words and phrases mentioned in exam response - Facebook likes, Websites visited - user interaction details in online learning For example: - words and phrases mentioned in exam response - Facebook likes, Websites visited - user interaction details in online learning Could be many millions! Could be many millions!
  15. 15. Scaling Up: More FeaturesScaling Up: More Features Previous Knowledge Preparation Time ... ... ... ... ... Passed Exam? Student 1 80% 48h ... ... ... ... ... Yes Student 2 50% 75h ... ... ... ... ... Yes Student 3 95% 24h ... ... ... ... ... Yes Student 4 60% 24h ... ... ... ... ... No Student 5 80% 10h ... ... ... ... ... No Classic solution: Feature Selection
  16. 16. Scaling Up: More FeaturesScaling Up: More Features Previous Knowledge Preparation Time ... ... ... ... ... Passed Exam? Student 1 80% 48h ... ... ... ... ... Yes Student 2 50% 75h ... ... ... ... ... Yes Student 3 95% 24h ... ... ... ... ... Yes Student 4 60% 24h ... ... ... ... ... No Student 5 80% 10h ... Σ ... ... ... No Scalable solution: Buckets with sums of original features “Clicked on http://physics...” “Clicked on http://icsi.berkeley...”
  17. 17. Scaling Up: More FeaturesScaling Up: More Features F0 F1 F2 F3 ... ... Fn Passed Exam? Student 1 ... ... ... ... ... ... ... Yes Student 2 ... ... ... ... ... ... ... Yes Student 3 ... ... ... ... ... ... ... Yes Student 4 ... ... ... ... ... ... ... No Student 5 ... ... ... ... ... ... ... No Feature Hashing: Use fixed feature dimensionality n. Hash original Feature ID (e.g. “Clicked on http://...”) to a bucket number in 0 to n-1 Normalize features and use bucket-wise sums Feature Hashing: Use fixed feature dimensionality n. Hash original Feature ID (e.g. “Clicked on http://...”) to a bucket number in 0 to n-1 Normalize features and use bucket-wise sums Small loss of precision usually trumped by big gains from being able to use more features Small loss of precision usually trumped by big gains from being able to use more features
  18. 18. Scaling Up: More Training ExamplesScaling Up: More Training Examples Banko & Brill (2001): Word confusion experiments (e.g. “principal” vs. “principle”)
  19. 19. Scaling Up: More Training ExamplesScaling Up: More Training Examples Banko & Brill (2001): Word confusion experiments (e.g. “principal” vs. “principle”) More Data often trumps better Algorithms Alon Halevy, Peter Norvig, Fernando Pereira (2009). The Unreasonable Effectiveness of Data More Data often trumps better Algorithms Alon Halevy, Peter Norvig, Fernando Pereira (2009). The Unreasonable Effectiveness of Data
  20. 20. Scaling Up: More Training ExamplesScaling Up: More Training Examples Léon Bottou. Learning with Large Datasets Tutorial. Text Classification experiments
  21. 21. Background: Stochastic Gradient Descent Background: Stochastic Gradient Descent Images:http://en.wikipedia.org/wiki/File:Hill_climb.png move towards optimum by approximating gradient based on 1 (or a small batch of) random training examples move towards optimum by approximating gradient based on 1 (or a small batch of) random training examples Stochastic nature may help us escape local optima Stochastic nature may help us escape local optima Improved variants: AdaGrad (Duchi et al. 2011) AdaDelta (Zeiler et al. 2012) Improved variants: AdaGrad (Duchi et al. 2011) AdaDelta (Zeiler et al. 2012)
  22. 22. Scaling Up: More Training ExamplesScaling Up: More Training Examples Recommended Tool VowPal Wabbit By John Langford et al. http://hunch.net/~vw/ Recommended Tool VowPal Wabbit By John Langford et al. http://hunch.net/~vw/
  23. 23. Scaling Up: More Training ExamplesScaling Up: More Training Examples Parallelization?Parallelization? Use lock-free approach to updating weight vector components (HogWild! by Niu, Recht, et al.)
  24. 24. Problem: Where to get Training Examples? Problem: Where to get Training Examples? Gerard de Melo Labeled Data is expensive! ● Penn Chinese Treebank: 2 years for 4000 sentences ● Adaptation is difficult ● Wall Street Journal ≠ Novels ≠ Twitter ● For Speech Recognition, ideally need training data for each domain, voice/accent, microphone, microphone setup, social setting, etc. http://en.wikipedia.org/wiki/File:Chronic_fatigue_syndrome.JPG
  25. 25. Semi-Supervised Learning Semi-Supervised Learning
  26. 26. ● Goal: When learning a model, use unlabeled data in addition to labeled data ● Example: Cluster-and-label ● Run a clustering algorithm on labeled and unlabeled data ● Assign cluster majority label to unlabeled examples of every cluster Image: Wikipedia Semi-Supervised LearningSemi-Supervised Learning
  27. 27. ● Bootstrapping or Self-Training (e.g. Yarowsky 1995) – Use classifier to label the unlabelled examples – Add the labels with the highest confidence to the training data and re-train – Repeat Semi-Supervised LearningSemi-Supervised Learning
  28. 28. ● Co-Training (Blum & Mitchell 1998) ● Given: multiple (ideally independent) views of the same data (e.g. left context and right context of a word) ● Learn separate models for each view ● Allow different views to teach each other: Model 1 can generate labels that will be helpful to improve model 2 and vice versa. Semi-Supervised LearningSemi-Supervised Learning
  29. 29. Semi-Supervised Learning: Transductive Setting Semi-Supervised Learning: Transductive Setting Image: Partha Pratim Talukdar
  30. 30. Semi-Supervised Learning: Transductive Setting Semi-Supervised Learning: Transductive Setting Image: Partha Pratim Talukdar Algorithms: Label Propagation (Zhu et al. 2003), Adsorption (Baluja et al. 2008), Modified Adsorption (Talukdar et al. 2009)
  31. 31. ● Sentiment Analysis: ● Look for Twitter tweets with emoticons like “:)”, “:(“ ● Remove emoticons. Then use as training data! Crimson Hexagon Distant SupervisionDistant Supervision
  32. 32. Representation Learning to Better Exploit Big Data Representation Learning to Better Exploit Big Data
  33. 33. RepresentationsRepresentationsRepresentationsRepresentations Image: David Warde-Farley via Bengio et al. Deep Learning Book
  34. 34. RepresentationsRepresentationsRepresentationsRepresentations Inputs Bits: 0011001….. Images: Marc'Aurelio Ranzato Note sharing between classes Note sharing between classes
  35. 35. RepresentationsRepresentationsRepresentationsRepresentations Inputs Bits: 0011001….. Images: Marc'Aurelio Ranzato Massive improvements in image object recognition (human-level?), speech recognition. Good improvements in NLP and IR-related tasks. Massive improvements in image object recognition (human-level?), speech recognition. Good improvements in NLP and IR-related tasks.
  36. 36. ExampleExampleExampleExample Google's image Source: Jeff Dean, Google
  37. 37. Inspiration: The BrainInspiration: The BrainInspiration: The BrainInspiration: The Brain Source: Alex Smola Input: delivered via dendrites from other neurons Processing: Synapses may alter input signals. The cell then combines all input signals Output: If enough activation from inputs, output signal sent through a long cable (“axon”) Input: delivered via dendrites from other neurons Processing: Synapses may alter input signals. The cell then combines all input signals Output: If enough activation from inputs, output signal sent through a long cable (“axon”)
  38. 38. PerceptronPerceptronPerceptronPerceptron Input: Features Every feature fi gets a weight wi . Input: Features Every feature fi gets a weight wi . feature weight dog 7.2 food 3.4 bank -7.3 delicious 1.5 train -4.2 Feature f1 Feature f2 Feature f3 Feature f4 Neuron w1 w2 w3 w4
  39. 39. PerceptronPerceptronPerceptronPerceptron Neuron Output w1 w2 w3 w4 Activation of Neuron Multiply the feature values of an object x with the feature weights. Activation of Neuron Multiply the feature values of an object x with the feature weights. a(x)=∑ i wi f i (x)=w t f (x) Feature f1 Feature f2 Feature f3 Feature f4
  40. 40. PerceptronPerceptronPerceptronPerceptron Neuron Output w1 w2 w3 w4 Output of Neuron Check if activation exceeds a threshold t = –b Output of Neuron Check if activation exceeds a threshold t = –b Feature f1 Feature f2 Feature f3 Feature f4 output(x)=g(w t f (x)+b) e.g. g could return 1 (positive) if positive, -1 otherwise e.g. 1 for “spam”, -1 for “not-spam” e.g. 1 for “spam”, -1 for “not-spam”
  41. 41. Decision SurfacesDecision SurfacesDecision SurfacesDecision Surfaces Decision Trees Linear Classifiers (Perceptron, SVM) Kernel-based Classifiers (Kernel Perceptron, Kernel SVM) Multi-Layer Perceptron Images: Vibhav Gogate Not max-marginNot max-margin Only straight decision surface Only straight decision surface Any decision surface Any decision surface
  42. 42. Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Neuron 1 Output Feature f1 Feature f2 Feature f3 Feature f4 Neuron 2 Neuron Input Layer Output LayerHidden Layer
  43. 43. Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Neuron 1 Output Feature f1 Feature f2 Feature f3 Feature f4 Neuron 2 Neuron Neuron 2 Input Layer Hidden Layer Output Layer
  44. 44. Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Neuron 1 Output 1 Feature f1 Feature f2 Feature f3 Feature f4 Neuron 2 Neuron Neuron 2 Neuron Output 2 Input Layer Hidden Layer Output Layer
  45. 45. Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Single-Layer: output(x)=g(W f (x)+b) Input Layer (Feature Extraction) f (x) Three-Layer Network: output(x)=g2(W 2 g1(W 1 f (x)+b1)+b2) Four-Layer Network: output(x)=g3(W 3 g2(W 2 g1(W 1 f (x)+b1)+b2)+b3)
  46. 46. Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron Deep Learning:Deep Learning: Multi-Layer PerceptronMulti-Layer Perceptron
  47. 47. Deep Learning:Deep Learning: Computing the OutputComputing the Output Deep Learning:Deep Learning: Computing the OutputComputing the Output Simply evaluate the output function (for each node, compute an output based on the node inputs) Simply evaluate the output function (for each node, compute an output based on the node inputs) Output y1 Input x1 Input x2 z1 z2 z3 Output y2
  48. 48. Deep Learning:Deep Learning: TrainingTraining Deep Learning:Deep Learning: TrainingTraining Compute error on output, if non-zero, do a stochastic gradient step on the error function to fix it Compute error on output, if non-zero, do a stochastic gradient step on the error function to fix it Backpropagation The error is propagated back from output nodes towards the input layer Backpropagation The error is propagated back from output nodes towards the input layer Output y1 Input x1 Input x2 z1 z2 z3 Output y2
  49. 49. Exploit the chain rule to compute the gradient Deep Learning:Deep Learning: TrainingTraining Deep Learning:Deep Learning: TrainingTraining Backpropagation The error is propagated back from output nodes towards the input layer Backpropagation The error is propagated back from output nodes towards the input layer Compute error on output, if non-zero, do a stochastic gradient step on the error function to fix it Compute error on output, if non-zero, do a stochastic gradient step on the error function to fix it x y=f(x) z=g(y) ∂ z ∂ y ∂ y ∂ x ∂ z ∂ x = ∂ z ∂ y ∂ y ∂ x We are interested in the gradient, i.e. the partial derivatives for the output function z=g(y) with respect to all [inputs and] weights, including those at a deeper part of the network
  50. 50. DropOut TechniqueDropOut TechniqueDropOut TechniqueDropOut Technique Basic Idea While training, randomly drop inputs (make the feature zero) Basic Idea While training, randomly drop inputs (make the feature zero) Effect Training on variations of original training data (artificial increase of training data size). Trained network relies less on the existence of specific features. Effect Training on variations of original training data (artificial increase of training data size). Trained network relies less on the existence of specific features. Reference: Hinton et al. (2012) Also: Maxout Networks by Goodfellow et al. (2013)
  51. 51. Deep Learning:Deep Learning: Convolutional Neural NetworksConvolutional Neural Networks Deep Learning:Deep Learning: Convolutional Neural NetworksConvolutional Neural Networks Image: http://torch.cogbits.com/doc/tutorials_supervised/ Reference: Yann LeCun's work
  52. 52. Deep Learning:Deep Learning: Recurrent Neural NetworksRecurrent Neural Networks Deep Learning:Deep Learning: Recurrent Neural NetworksRecurrent Neural Networks Source: Bayesian Behavior Lab, Northwestern University
  53. 53. Deep Learning:Deep Learning: Recurrent Neural NetworksRecurrent Neural Networks Deep Learning:Deep Learning: Recurrent Neural NetworksRecurrent Neural Networks Source: Bayesian Behavior Lab, Northwestern University Then can do backpropagation. Challenge: Vanishing/Exploding gradients Then can do backpropagation. Challenge: Vanishing/Exploding gradients
  54. 54. Deep Learning:Deep Learning: Long Short Term Memory NetworksLong Short Term Memory Networks Deep Learning:Deep Learning: Long Short Term Memory NetworksLong Short Term Memory Networks Source: Bayesian Behavior Lab, Northwestern University
  55. 55. Deep Learning:Deep Learning: Long Short Term Memory NetworksLong Short Term Memory Networks Deep Learning:Deep Learning: Long Short Term Memory NetworksLong Short Term Memory Networks Deep LSTMs for Sequence-to-sequence Learning Suskever et al. 2014 (Google)
  56. 56. Deep Learning:Deep Learning: Long Short Term Memory NetworksLong Short Term Memory Networks Deep Learning:Deep Learning: Long Short Term Memory NetworksLong Short Term Memory Networks French Original: La dispute fait rage entre les grands constructeurs aéronautiques ̀ propos de la largeur des sìges de la classe touriste sur les vols long-courriers, ouvrant la voie ̀ une confrontation am̀re lors du salon aéronautique de Dubä qui a lieu de mois-ci. LSTM's English Translation: The dispute is raging between large aircraft manufacturers on the size of the tourist seats on the long-haul flights, leading to a bitter confrontation at the Dubai Airshow in the month of October. Ground Truth English Translation: A row has flared up between leading plane makers over the width of tourist-class seats on long-distance flights, setting the tone for a bitter confrontation at this Month's Dubai Airshow. Suskever et al. 2014 (Google)
  57. 57. Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
  58. 58. Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
  59. 59. Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
  60. 60. Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
  61. 61. Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
  62. 62. Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Deep Learning:Deep Learning: Neural Turing MachinesNeural Turing Machines Learning to sort!Learning to sort! Vectors for numbers are random
  63. 63. Big Data in Feature Engineering and Representation Learning Big Data in Feature Engineering and Representation Learning
  64. 64. ● Language Models for Autocompletion Web Semantics: Statistics from Big Data as Features Web Semantics: Statistics from Big Data as Features
  65. 65. Source: Wang et al. An Overview of Microsoft Web N-gram Corpus and Applications Word SegmentationWord Segmentation
  66. 66. ● NP Coordination Source: Bansal & Klein (2011) Parsing: AmbiguityParsing: Ambiguity
  67. 67. Source: Bansal & Klein (2011) Parsing: Web SemanticsParsing: Web Semantics
  68. 68. ● Lapata & Keller (2004): The Web as a Baseline (also: Bergsma et al. 2010) ● “big fat Greek wedding” but not “fat Greek big wedding” Source: Shane Bergsma Adjective OrderingAdjective Ordering
  69. 69. Source: Bansal & Klein 2012 Coreference ResolutionCoreference Resolution
  70. 70. Source: Bansal & Klein 2012 Coreference ResolutionCoreference Resolution
  71. 71. ● Data Sparsity: E.g. most words are rare (in the “long tail”) → Missing in training data ● Solution (Blitzer et al. 2006, Koo & Collins 2008, Huang & Yates 2009, etc.) ● Cluster together similar features ● Use clustered features instead of / in addition to original features Brown Corpus Source: Baroni & Evert Distributional SemanticsDistributional Semantics
  72. 72. Even worse: Arnold Schwarzenegger Spelling CorrectionSpelling Correction
  73. 73. Vector RepresentationsVector Representations x x x x petronia sparrow parched arid xdry xbird Put words into a vector space (e.g. with d=300 dimensions) Put words into a vector space (e.g. with d=300 dimensions)
  74. 74. Word Vector RepresentationsWord Vector Representations Tomas Mikolov et al. Proc. ICLR 2013. Available from https://code.google.com/p/word2vec/
  75. 75. WikipediaWikipedia
  76. 76. ● Exploit edit history, especially on Simple English Wikipedia ● “collaborate” → “work together” “stands for” → “is the same as” Text SimplificationText Simplification
  77. 77. Answering Questions IBM's Jeopardy!-winning Watson system Gerard de Melo
  78. 78. Knowledge Integration
  79. 79. UWN/MENTA multilingual extension of WordNet for word senses and taxonomical information over 200 languages www.lexvo.org/uwn/
  80. 80. WebChild AAAI 2014 WSDM 2014 AAAI 2011 WebChild AAAI 2014 WSDM 2014 AAAI 2011 WebChild: Common-Sense Knowledge WebChild: Common-Sense Knowledge
  81. 81. Challenge: From Really Big DataChallenge: From Really Big Data to Real Insightsto Real Insights Challenge: From Really Big DataChallenge: From Really Big Data to Real Insightsto Real Insights Image: Brett Ryder
  82. 82. Big Data Mining in Practice Big Data Mining in Practice
  83. 83. Gerard de Melo (Tsinghua University, Bejing China) Aparna Varde (Montclair State University, NJ, USA) DASFAA, Hanoi,Vietnam, April 2015 1
  84. 84. Dr. Aparna Varde 2
  85. 85.  Internet-based computing - shared resources, software & data provided on demand, like the electricity grid  Follows a pay-as-you-go model 3
  86. 86.  Several technologies, e.g., MapReduce & Hadoop  MapReduce: Data-parallel programming model for clusters of commodity machines • Pioneered by Google • Processes 20 PB of data per day  Hadoop: Open-source framework, distributed storage and processing of very large data sets • HDFS (Hadoop Distributed File System) for storage • MapReduce for processing • Developed by Apache 4
  87. 87. • Scalability – To large data volumes – Scan 100 TB on 1 node @ 50 MB/s = 24 days – Scan on 1000-node cluster = 35 minutes • Cost-efficiency – Commodity nodes (cheap, but unreliable) – Commodity network (low bandwidth) – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers) 5
  88. 88. Data type key-value records Map function (Kin,Vin)  list(Kinter,Vinter) Reduce function (Kinter, list(Vinter))  list(Kout,Vout) 6
  89. 89. MapReduce Example the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output 7
  90. 90.  40 nodes/rack, 1000-4000 nodes in cluster  1 Gbps bandwidth in rack, 8 Gbps out of rack  Node specs (Facebook): 8-16 cores, 32 GB RAM, 8×1.5 TB disks, no RAID Aggregation switch Rack switch 8
  91. 91.  Files split into 128MB blocks  Blocks replicated across several data nodes (often 3)  Name node stores metadata (file names, locations, etc)  Optimized for large files, sequential reads  Files are append-only Namenode Datanodes 1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4 File1 9
  92. 92.  Hive: Relational D/B on Hadoop developed at Facebook  Provides SQL-like query language 10
  93. 93. Supports table partitioning, complex data types, sampling, some query optimization These help discover knowledge by various tasks, e.g., • Search for relevant terms • Operations such as word count • Aggregates like MIN, AVG 11
  94. 94. /* Find documents of enron table with word frequencies within range of 75 and 80 */ SELECT DISTINCT D.DocID FROM docword_enron D WHERE D.count > 75 and D.count < 80 limit 10; OK 1853… 11578 16653 Time taken: 64.788 seconds 12
  95. 95. /* Create a view to find the count for WordID=90 and docID=40, for the nips table */ CREATEVIEW Word_Freq AS SELECT D.DocID, D.WordID,V.word, D.count FROM docword_Nips D JOIN vocabNipsV ON D.WordID=V.WordID AND D.DocId=40 and D.WordId=90; OK Time taken: 1.244 seconds 13
  96. 96. /* Find documents which use word "rational" from nips table */ SELECT D.DocID,V.word FROM docword_Nips D JOIN vocabnips V ON D.wordID=V.wordID and V.word="rational" LIMIT 10; OK 434 rational 275 rational 158 rational …. 290 rational 422 rational Time taken: 98.706 seconds 14
  97. 97. /* Find average frequency of all words in the enron table */ SELECT AVG(count) FROM docWord_enron; OK 1.728152608060543 Time taken: 68.2 seconds 15
  98. 98. Query Execution Time for HQL & MySQL on big data sets Similar claims for other SQL packages 16
  99. 99. 17 Server Storage Capacity Max Storage per instance
  100. 100. 18
  101. 101.  Hive supports rich data types: Map, Array & Struct; Complex types  It supports queries with SQL Filters, Joins, Group By, Order By etc.  Here is when (original) Hive users miss SQL….  No support in Hive to update data after insert  No (or little) support in Hive for relational semantics (e.g., ACID)  No "delete from" command in Hive - only bulk delete is possible  No concept of primary and foreign keys in Hive 19
  102. 102.  Ensure dataset is already compliant with integrity constraints before load  Ensure that only compliant data rows are loaded SELECT & temporary staging table  Check for referential constraints using Equi- Join and query on those rows that comply 20
  103. 103. Providing more power than Hive, Hadoop & MR 21
  104. 104. Cloudera’s Impala: More efficient SQL- compliant analytic database Hortonworks’ Stinger: Driving the future of Hive with enterprise SQL at Hadoop scale Apache’s Mahout: Machine Learning Algorithms for Big Data Spark: Lightning fast framework for Big Data MLlib: Machine Learning Library of Spark MLbase: Platform Base for MLlib in Spark 22
  105. 105.  Fully integrated, state-of-the-art analytic D/B to leverage the flexibility & scalability of Hadoop  Combines benefits • Hadoop: flexibility, scalability, cost-effectiveness • SQL: performance, usability, semantics 23
  106. 106. MPP: Massively Parallel Processing
  107. 107.  NDV: function for counting • Table w/ 1 billion rows  COUNT(DISTINCT) • precise answer • slow for large-scale data  NDV() function • approximate result • much faster 25
  108. 108. Hardware Configuration • Generates less CPU load than Hive • Typical performance gains: 3x-4x • Impala cannot go faster than hardware permits! Query Complexity • Single-table aggregation queries : less gain • Queries with at least one join: gains of 7-45X Main Memory as Cache • Data accessed by query is in cache, speedup is more • Typical gains with cache: 20x-90x 26
  109. 109.  No - Many viable use cases for MR & Hive • Long-running data transformation workloads & traditional DW frameworks • Complex analytics on limited, structured data sets  Impala is a complement to the approaches • Supports cases with very large data sets • Especially to get focused result sets quickly 27
  110. 110. Drive future of Hive with enterprise SQL at Hadoop scale 3 main objectives • Speed: Sub-second query response times • Scale: From GB to TB & PB • SQL: Transactions & SQL:2011 analytics for Hive 28
  111. 111.  Wider use cases with modifications to data  BEGIN, COMMIT, ROLLBACK for multi-stmt transactions 29
  112. 112.  Hybrid engine with LLAP (Live Long and Process) • Caching & data reuse across queries • Multi-threaded execution • High throughput I/O • Granular column level security 30
  113. 113. Common Table Expressions Sub-queries: correlated & uncorrelated Rollup, Cube, Standard Aggregates Inner, Outer, Semi & Cross Joins Non Equi-Joins Set Functions: Union, Except & Intersect Most sub-queries, nested and otherwise 31
  114. 114. 32
  115. 115. Supervised and Unsupervised Learning Algorithms 33
  116. 116. ML algorithms on distributed frameworks good for mining big data on the cloud Supervised Learning: e.g. Classification Unsupervised Learning: e.g. Clustering The word Mahout means “elephant rider” in Hindi (from India), an interesting analogy  34
  117. 117.  Clustering (Unsupervised) • K-means, Fuzzy k-means, Streaming k-means etc.  Classification (Supervised) • Random Forest, Naïve Bayes etc.  Collaborative Filtering (Semi-Supervised) • Item Based, Matrix Factorization etc.  Dimensionality Reduction (For Learning) • SVD, PCA etc.  Others • LDA for Topic Models, Sparse TF-IDF Vectors from Text etc. 35
  118. 118.  Input: Big Data from Emails • Goal: automatically classify text in various categories  Prior to classifying text, TF-IDF applied • Term Frequency – Inverse Document Frequency • TF-IDF increases with frequency of word in doc, offset by frequency of word in corpus  Naïve Bayes used for classification • Simple classifier using posteriori probability • Each attribute is distributed independently of others 36
  119. 119. Training Data Pre- Process Training Algorithm Model Building the Model Historical data with reference decisions: • Collected a set of e- mails organized in directories labeled with predictor categories: • Mahout • Hive • Other • Stored email as text in HDFS on an EC2 virtual server Using Apache Mahout: • Convert text files to HDFS Sequential File format • Create TF-IDF weighted Sparse Vectors Build and evaluate the model with Mahout’s implementation of Naïve Bayes Classifier Classification Model which takes as input vectorized text documents and assigns one of three document topics: • Mahout • Hive • Other 37
  120. 120. New Data Model Output Using Model to Classify New Data • Store a set of new email documents as text in HDFS on an EC2 virtual server • Pre-process using Apache Mahout’s Libraries Use the existing model to predict topics for new text files. This was implemented in Java • With Apache Mahout Libraries and • Apache Maven to manage dependencies and build the project • The executable JAR file is submitted with the project documentation • The program works with data files stored in HDFS For each input document, the model returns one of the following categories: • Mahout • Hive • Other 38
  121. 121. Text Classification - Results Predicted Category: Hive Input: Output : Possible uses • Automatic email sorting • Automatic news classification • Topic modeling 39
  122. 122.  Lightning fast processing for Big Data  Open-source cluster computing developed in AMPLab at UC Berkeley  Advanced DAG execution engine for cyclic data flow & in-memory computing  Very well-suited for large-scale Machine Learning 40
  123. 123. Spark runs much faster than Hadoop & MR 100x faster in memory 10x faster on disk
  124. 124. 42
  125. 125.  Uses TimSort (derived from merge-sort & insertion-sort), faster than quick-sort  Exploits cache locality due to in-memory computing  Fault-tolerant when scaling, well-designed for failure-recovery  Deploys power of cloud through enhanced N/W & I/O intensive throughput 43
  126. 126.  Spark Core: Distributed task dispatching, scheduling, basic I/O  Spark SQL: Has SchemaRDD (Resilient Distributed Databases); SQL support with CLI, ODBC/JDBC  Spark Streaming: Uses Spark’s fast scheduling for stream analytics, code written for batch analytics can be used for streams  MLlib: Distributed machine learning framework (10x of Mahout)  Graph X: Distributed graph processing framework with API 44
  127. 127.  MLBase - tools & interfaces to bridge gap b/w operational & investigative ML  Platform to support MLlib - the distributed Machine Learning Library on top of Spark 45
  128. 128.  MLlib: Distributed ML library for classification, regression, clustering & collaborative filtering  MLI: API for feature extraction, algorithm development, high-level ML programming abstractions  ML Optimizer: Simplifies ML problems for end users by automating model selection 46
  129. 129.  Classification • Support Vector Machines (SVM), Naive Bayes, decision trees  Regression • linear regression, regression trees  Collaborative Filtering • Alternating Least Squares (ALS)  Clustering • k-means  Optimization • Stochastic gradient descent (SGD), Limited-memory BFGS  Dimensionality Reduction • Singular value decomposition (SVD), principal component analysis (PCA) 47
  130. 130. Easily interpretable Ensembles are top performers Support for categorical variables Can handle missing data Distributed decision trees Scale well to massive datasets 48
  131. 131. Uses modified version of k-means Feature extraction & selection • Extraction: lot of time and tools • Selection: domain expertise • Wrong selection of features: bad quality clusters Glassbeam’s SCALAR platform • SPL (Semiotic Parsing Language) • Makes feature engineering easier & faster 49
  132. 132. 50
  133. 133.  High-D data: Not all features IMP to build model & answer Qs  Many applications: Reduce dimensions before building model  MLlib: 2 algorithms for dimensionality reduction • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) 51
  134. 134. 52
  135. 135. 53
  136. 136.  Grow into unified platform for data scientists  Reduce time to market with platforms like Glassbeam’s SCALAR for feature engineering  Include more ML algorithms  Introduce enhanced filters  Improve visualization for better performance 54
  137. 137. Processing Streaming Data on the Cloud 55
  138. 138. Apache Storm • Reliably process unbounded streams, do for real-time what Hadoop did for batch processing Apache Flink • Fast & reliable large scale data processing engine with batch & stream based alternatives 56
  139. 139.  Integrates queueing & D/B  Nimbus node • Upload computations • Distribute code on cluster • Launch workers on cluster • Monitor computation  ZooKeeper nodes • Coordinate the Storm cluster  Supervisor nodes • Interacts w/ Nimbus through Zookeeper • Starts & stops workers w/ signals from Nimbus 57
  140. 140.  Tuple: ordered list of elements, e.g., a “4-tuple” (7, 1, 3, 7)  Stream: unbounded sequence of tuples  Spout: source of streams in a computation (e.g.Twitter API)  Bolt: process I/P streams & produce O/P streams to run functions  Topology: overall calculation, as N/W of spouts and bolts 58
  141. 141.  Exploits in-memory data streaming & adds iterative processing into system  Makes system super fast for data- intensive & iterative jobs 59 Performance Comparison
  142. 142.  Requires few config parameters  Built-in optimizer finds best way to run program  Supports all Hadoop I/O & data types  Runs MR operators unmodified & faster  Reads data from HDFS 60 Execution of Flink
  143. 143. Summary and Ongoing Work 61
  144. 144.  MapReduce & Hadoop: Pioneering tech  Hive: SQL-like, good for querying  Impala: Complementary to Hive, overcomes its drawbacks  Stinger: Drives future of Hive w/ advanced SQL semantics  Mahout: ML algorithms (Sup & Unsup)  Spark: Framework more efficient & scalable than Hadoop  MLlib: Machine Learning Library of Spark  MLbase: Platform supporting MLlib  Storm: Stream processing for cloud & big data  Flink: Both stream & batch processing 62
  145. 145.  Store & process Big Data? • MR & Hadoop - Classical technologies • Spark – Very large data, Fast & scalable  Query over Big Data? • Hive – Fundamental, SQL-like • Impala – More advanced alternative • Stinger - Making Hive itself better  ML supervised / unsupervised? • Mahout - Cloud based ML for big data • MLlib - Super large data sets, super fast  Mine over streaming big data? • Only streams – Storm • Batch & Streams – Flink 63
  146. 146.  Big Data Skills • Cloud Technology • Deep Learning • Business Perspectives • Scientific Domains  Salary $100k + • Big Data Programmers • Big Data Analysts  Univ Programs & concentrations • Data Analytics • Data Science 64
  147. 147.  Big Data Concentration being developed in CS Dept • http://cs.montclair.edu/ • Data Mining, Remote Sensing, HCI, Parallel Computing, Bioinformatics, Software Engineering  Global Education Programs available for visiting and exchange students • http://www.montclair.edu/global- education/  Please contact me for details 65
  148. 148.  Include more cloud intelligence in big data analytics  Further bridge gap between Hive & SQL  Add features from standalone ML packages to Mahout, MLlib …  Extend big data capabilities to focus more on PB & higher scales  Enhance mining of big data streams w/ cloud services & deep learning  Address security & privacy issues on a deeper level for cloud & big data  Build lucrative applications w/ scalable technologies for big data  Conduct domain-specific research, e.g., Cloud & GIS, Cloud & Green Computing 66
  149. 149. [1] M. Bansal, D. Klein. Coreference Semantics from Web Features. In Proceedings of ACL 2012. [2] R. Bekkerman, M. Bilenko, J. Langford (Eds.). Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2011. [3] T. Brants, A. Franz.Web 1T 5-Gram Version 1, Linguistic Data Consortium, 2006. [4] R. Collobert, J.Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011. [5] G. de Melo. Exploiting Web Data Sources for Advanced NLP. In Proceedings of COLING 2012. [6] G. de Melo, K. Hose. Searching the Web of Data. In Proceedings of ECIR 2013. Springer LNCS. [7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI-04, San Francisco, CA, pp. 137-149. [8] C. Engle, A. Lupher, R. Xin, M. Zaharia,M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis using Coarse-Grained Distributed Memory. In Proceedings of SIGMOD 2013, ACM, pp. 689-692. [9] A. Ghoting, P. Kambadur, E. Pednault, R. Kannan. NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce. In Proceedings of KDD 2011. ACM, New York, NY, USA, pp. 334-342. [10] K. Hammond, A.Varde. Cloud-Based Predictive Analytics,. In ICDM-13 KDCloud Workshop, Dallas,Texas, December 2013. 67
  150. 150. [11] K. Hose, R. Schenkel, M.Theobald, G.Weikum: Database Foundations for Scalable RDF Processing. Reasoning Web 2011, pp. 202-249. [12] R. Kiros, R. Salakhutdinov, R. Zemel. Multimodal Neural Language Models. In Proc. of ICML, 2014. [13] T. Kraska, A.Talwalkar,J.Duchi, R. Griffith, M. Franklin, M. Jordan. MLbase: A Distributed Machine Learning System. In Conference on Innovative Data Systems Research, 2013. [14] Q.V. Le,T. Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of ICML, 2014. [15] J. Leskovec, A. Rajaraman,J. Ullman. Mining of Massive Datasets. Cambridge University Press, 2011. [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean. Distributed Representations of Words and Phrases and Their Compositionality. In NIPS 26, pp. 3111–3119, 2013. [17] R. Nayak, P. Senellart, F. Suchanek, A.Varde. Discovering Interesting Information with Advances in Web Technology. SIGKDD Explorations 14(2): 63-81 (2012). [18] M. Riondato, J. DeBrabant, R. Fonseca, E. Upfal.PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of CIKM 2012. ACM, pp. 85-94. [19] F. Suchanek, A.Varde, R.Nayak, P.Senellart.The Hidden Web, XML and the Semantic Web: Scientific Data Management Perspectives. EDBT 2011, Uppsala, Sweden, pp. 534-537. 68
  151. 151. [20] N.Tandon, G. de Melo, G.Weikum. Deriving a Web-Scale Common Sense Fact Database.In Proceedings of AAAI 2011. [21] J.Tancer, A.Varde.The Deployment of MML For Data Analytics Over The Cloud. In ICDM-11 KDCloud Workshop, December 2011,Vancouver, Canada, pp. 188-195. [22] A.Thusoo, J. Sarma,N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P.Wyckoff, R. Murthy. Hive: A Petabyte Scale Data Warehouse Using Hadoop, 2009. [23] J.Turian, L. Ratinov,Y. Bengio.Word Representations: A simple and general method for semi-supervised learning. In Proceedings of ACL 2010. [24] A.Varde, F. Suchanek, R.Nayak, P.Senellart. Knowledge Discovery over the Deep Web, Semantic Web and XML. DASFAA 2009, Brisbane, Australia, pp. 784-788. [25] T.White. Hadoop.The Definitive Guide, O’Reilly, 2011. [26] http://flink.incubator.apache.org [27] https://github.com/twitter/scalding‎ [28] http://hortonworks.com/labs/stinger/ [29]http://linkeddata.org/ [30] http://mahout.apache.org/ [31] https://storm.apache.org 69
  152. 152. Contact Information Gerard de Melo (gdm@demelo.org) [http://gerard.demelo.org] Aparna Varde (vardea@montclair.edu) [http://www.montclair.edu/~vardea] 70

×