Open problems big_data_19_feb_2015_ver_0.1

1
Open Problems in Big Data
Analytics: A Practitioner’s View
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
Invited Talk, National Conference on Distributed
Machine Learning, Feb 2015

Contents
2
State-of-art in Big Data Analytics
Big Data Computations: Characterization
Big Data pipelines: open problems

• Start from business questions
• How quickly and accurately can we get
answers?
• Data gets stored in HDFS
• Various frameworks to process data
• Spark – machine learning
• Giraph/GraphLab – graph processing
• Storm – real-time processing
State of Art in Big Data Analytics
3

• HDFS the right storage?
• Alternatives
• Cassandra, MapR – M7, QFS,
Cleversafe, Isilion, etc.
http://www.inktank.com/news-events/new/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
4

5
• Spark the right platform for processing?
• Alternatives
• Flink
• Forge – meta domain specific
language

6
• Spark Streaming/Storm the right platform
for stream processing?

7
Big Data ComputationsComputations/Operations
Giant 1 (simple stats) is perfect
for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-
body), 4 (optimization) Spark
from UC Berkeley is efficient?
Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first
approach for consumer churn
analysis [2]
Interactive/On-the-fly data
processing – Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Deep Learning
Artificial Neural Networks/Deep
Belief Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social
Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio
Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:

8
Big Data Pipelines
1. Nuance – incompleteness
2. Scale
3. Timeliness
4. Privacy
5. Human Loop

9
Big Data Pipelines: Data Acquisition
• Needle in a Haystack.
• Blink DB?
• Automatic metadata discovery

10
Big Data Pipelines: Information
Extraction
• Error models for data cleaning
• Multimedia data

11
Big Data Pipelines: Analytics
• Multi-dimensional data

The network to identify the individual digits
from the input image
http://neuralnetworksanddeeplearning.com/chap1.html
Copyright @Impetus Technologies, 2014

DLNs for Face Recognition

DLN for Face Recognition
http://www.slideshare.net/hammawan/deep-neural-networks

Copyright @Impetus Technologies,
2014
Success stories of DLNs
Android voice
recognition system –
based on DLNs
Improves accuracy by
25% compared to state-
of-art
Microsoft Skype Translate software
and Digital assistant Cortana
1.2 million images, 1000
classes (ImageNet Data)
– error rate of 15.3%,
better than state of art at
26.1%

Success stories of DLNs…..
Senna system – PoS tagging, chunking, NER,
semantic role labeling, syntactic parsing
Comparable F1 score with state-of-art with huge speed
advantage (5 days VS few hours).
DLNs VS TF-IDF: 1 million
documents, relevance search.
3.2ms VS 1.2s.
Robot navigation

18
• Hadoop = HDFS + Map-Reduce
• Useful for large scale embarrassingly
parallel processing of data sets
• Not so good for iterative, interactive
computing.
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.
• Real-time computation
• Processing specialized data structures
Conclusions

Thank You!
Mail • vijay.sa@impetus.co.in
LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran
Blogs • blogs.impetus.com
Twitter • @a_vijaysrinivas.

• Divyakant Agarwal et. al., Challenges and
Opportunities with Big Data, Computing
Research Association White Paper,
available from
http://www.cra.org/ccc/files/docs/init/bigdat
awhitepaper.pdf.
• Vijay Srinivas Agneeswaran et. al.,
Distributed Deep Learning over Spark,
available at:
http://www.datasciencecentral.com/profiles/
blogs/implementing-a-distributed-deep-
References
20

Open problems big_data_19_feb_2015_ver_0.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Open problems big_data_19_feb_2015_ver_0.1

Similar to Open problems big_data_19_feb_2015_ver_0.1 (20)

More from Vijay Srinivas Agneeswaran, Ph.D

More from Vijay Srinivas Agneeswaran, Ph.D (6)

Recently uploaded

Recently uploaded (20)

Open problems big_data_19_feb_2015_ver_0.1

Editor's Notes