This deck covers some of the open problems in the big data analytics space, starting with a discussion of state-of-art analytics using Spark/Hadoop YARN. It details out whether each of these are appropriate technologies and explores alternatives wherever possible. It ends with an important problem discussion - how to build a single system to handle big data pipelines without explicit data transfers.
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Open problems big_data_19_feb_2015_ver_0.1
1. 1
Open Problems in Big Data
Analytics: A Practitioner’s View
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
Invited Talk, National Conference on Distributed
Machine Learning, Feb 2015
3. • Start from business questions
• How quickly and accurately can we get
answers?
• Data gets stored in HDFS
• Various frameworks to process data
• Spark – machine learning
• Giraph/GraphLab – graph processing
• Storm – real-time processing
State of Art in Big Data Analytics
3
4. • HDFS the right storage?
• Alternatives
• Cassandra, MapR – M7, QFS,
Cleversafe, Isilion, etc.
http://www.inktank.com/news-events/new/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
State of Art in Big Data Analytics
4
5. 5
State of Art in Big Data Analytics
• Spark the right platform for processing?
• Alternatives
• Flink
• Forge – meta domain specific
language
6. 6
State of Art in Big Data Analytics
• Spark Streaming/Storm the right platform
for stream processing?
7. 7
Big Data ComputationsComputations/Operations
Giant 1 (simple stats) is perfect
for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-
body), 4 (optimization) Spark
from UC Berkeley is efficient?
Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first
approach for consumer churn
analysis [2]
Interactive/On-the-fly data
processing – Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Deep Learning
Artificial Neural Networks/Deep
Belief Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social
Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio
Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:
8. 8
Big Data Pipelines
1. Nuance – incompleteness
2. Scale
3. Timeliness
4. Privacy
5. Human Loop
9. 9
Big Data Pipelines: Data Acquisition
• Needle in a Haystack.
• Blink DB?
• Automatic metadata discovery
10. 10
Big Data Pipelines: Information
Extraction
• Error models for data cleaning
• Multimedia data
12. The network to identify the individual digits
from the input image
http://neuralnetworksanddeeplearning.com/chap1.html
Copyright @Impetus Technologies, 2014
13. DLNs for Face Recognition
Copyright @Impetus Technologies, 2014
15. Copyright @Impetus Technologies,
2014
Success stories of DLNs
Android voice
recognition system –
based on DLNs
Improves accuracy by
25% compared to state-
of-art
Microsoft Skype Translate software
and Digital assistant Cortana
1.2 million images, 1000
classes (ImageNet Data)
– error rate of 15.3%,
better than state of art at
26.1%
16. Copyright @Impetus Technologies, 2015
Success stories of DLNs…..
Senna system – PoS tagging, chunking, NER,
semantic role labeling, syntactic parsing
Comparable F1 score with state-of-art with huge speed
advantage (5 days VS few hours).
DLNs VS TF-IDF: 1 million
documents, relevance search.
3.2ms VS 1.2s.
Robot navigation
17.
18. 18
• Hadoop = HDFS + Map-Reduce
• Useful for large scale embarrassingly
parallel processing of data sets
• Not so good for iterative, interactive
computing.
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.
• Real-time computation
• Processing specialized data structures
Conclusions
20. • Divyakant Agarwal et. al., Challenges and
Opportunities with Big Data, Computing
Research Association White Paper,
available from
http://www.cra.org/ccc/files/docs/init/bigdat
awhitepaper.pdf.
• Vijay Srinivas Agneeswaran et. al.,
Distributed Deep Learning over Spark,
available at:
http://www.datasciencecentral.com/profiles/
blogs/implementing-a-distributed-deep-
References
20
Editor's Notes
Reference : http://neuralnetworksanddeeplearning.com/chap1.html
Consider the problem to identify the individual digits from the input image
Each image 28 by 28 pixel image. Then network is designed as follows
Input layer (image) -> 28*28 = 784 neurons. Each neuron corresponds to a pixel
The output layer can be identified by the number of digits to be identified i.e. 10 (0 to 9)
The intermediate hidden layer can be experimented with varied number of neurons. Let us fix at 10 nodes in hidden layer
Reference: http://neuralnetworksanddeeplearning.com/chap1.html
How about recognizing a human face from given set of random images?
Attack this problem in the similar fashion explained earlier. Input -> Image pixels, output -> Is it a face or not? (a single node)
A face can be recognized by answering some questions like “Is there an eye in the top left?”, “Is there a nose in the middle?” etc..
Each question corresponds to a hidden layer