This document discusses big data and intensive data processing. It defines big data and compares it to traditional analytics. It discusses technologies used for big data like Hadoop, MapReduce, and machine learning. It also discusses frameworks for analyzing big data like Apache Mahout and how Mahout is moving away from MapReduce to platforms like Apache Spark.
Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson Favilla
1. Processamento Intensivo de
Dados
Intensive Data Processing
(Big Data)
Nelson F. F.
Ebecken
NTT/COPPE/UFRJ
Your Big Data Is Worthless if You Don’t Bring It Into the Real World
http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the real--world/
2. Big Data
Big Data refers to data that is too big to fit on a
single server, too unstructured to fit into a
row-and-column database, or too
continuously flowing to fit into a static data
warehouse (Thomas H. Davenport)
3. Big Data and traditional analytics
Type of data
Volume of Data
Big Data
Unstructured formats
100 terabytes to petabytes
Traditional analytics
Formated in rows and
columns
Tens of terabytes or less
Flow of Data
Analysis methods
Constant flow of data
Machine Learning
Static pool of data
Hypothesis-based
Primary purpose Data-based products Internal decision support
and services
4. A menu of big data possibilities
Style of data Source of data Industry affected Function affected
Large volume Online Financial services Marketing
Unstructured Video Health care Supply chain
Continuous flow Sensor Manufacturing Human resources
Multiple formats Genomic Travel/transport Finance
5. Terminology for using and analyzing data
Term Time frame
Decision support 1970-1985
Executive support 1980-1990
Online analytical
processing OLAP
1990-2000
Business intelligence 1989-2005
Analytics 2005-2010
Big Data 2010-present
Specific meaning
Use of data analysis to
support decision making
Focus on data analysis for
decisions by senior
executives
Software for analysing
multidimensional data
tables
Tools to support data-driven
decisions, with
emphasis on reporting
Focus on ststistical and
mathematical analysis for
decisions
Focus on very large,
unstructured, fast moving
data
6. How important is Big Data to You and Your Organization ?
Has your management team considered some of the new types of data
that may affect your business and industry, both now and in the next
several years ?
Have you discussed the term big data and wether it’s a good description of
what your organization is doing with data and analytics ?
Are you beggining to change your decision-making processes toward a
more continuos approach driven by the continuos availability of data ?
Has your organization adopted faster and more agile approaches to
analyzing and acting on important data and analysis ?
Are you beggining to focus more on external information about business
and makets enviroments ?
Have you made a big bet on big data ?
7. Big data is going to reshape a lot of different
businesses and industries
Every industry that moves things
Every industry that sells to consumers
Every industry that emplys machinery
Every industry that sells or uses content
Every industry that provides service
Every industry that has physical facilities
Every industry that involves money
8. Responsability locus for big data projects
Cost savings
Faster decisions
Better decisions
Product/service innovation
Discovery
IT innovation group
Business unit or function
analytics group
Business unit or function
analytics group
R&D or product
development group
Production
IT architecture and
operations
Business unit or function
executive
Business unit or function
executive
Product development or
product management
9. Overview of technologies for big data
Technology
Hadoop
Definition
Open source software for processing
big data across multiple parallel servers
MapReduce
Scripting languages
Machine learning
Visual analytics
Natural language processing NLP
In-memory analytics
The architectural framework on which
Hadoop is based
Programming languages that work well
with big data (Python, Pig, Hive...)
Algorithms for rapidly finding the model
that best fits a data set
Display of analytical results in visual or
graphic formats
Algorithms for analyzing text, frequencies,
meanings,...
Processing big data in computer memory
for greater speed
10. MapReduce
MapReduce is a programming model for expressing
distributed computations on massive amounts of data and
an execution framework for large-scale data processing on
clusters of commodity servers.
It was originally developed by Google
In 2003, Google's distributed file system, called GFS
In 2004, Google published the paper that introduced
MapReduce
MapReduce has since enjoyed widespread adoption via
an open-source implementation called Hadoop, whose
development was led by Yahoo (an Apache project).
11. Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
Processes input key/value pair
Produces set of intermediate pairs
'map (in_key, in_value) -> list(out_key,
intermediate_value)I
• Produces a set of merged output values (usually just one)
'reduce (out_key, list(intermediate_value)) -> list(out_value)I
12. Map-Reduce
. Parallel programming for large masses of data
Map/Combine/Partition Shuffle Sort/Reduce
key/val key/val
key/val key/val
key/val key/val
Reduce output
Reduce output
Reduce output
input Map
input Map
input Map
14
13. Why learn models in MapReduce?
High data throughput
Stream about 100 Tb per hour using 500 mappers
Framework provides fault tolerance
Monitors mappers and reducers and re-starts tasks on
other machines should one of the machines fail
Excels in counting patterns over data records
Built on relatively cheap, commodity hardware
No special purpose computing hardware
Large volumes of data are being increasingly
stored on Grid clusters running MapReduce
Especially in the internet domain
14. Why learn models in MapReduce?
• Learning can become limited by computation
time and not data volume
With large enough data and number of machines
Reduces the need to down-sample data
More accurate parameter estimates compared to
learning on a single machine for the same amount of time
15. Learning models in MapReduce
A primer for learning models in MapReduce (MR)
Illustrate techniques for distributing the learning algorithm in a
MapReduce framework
Focus on the mapper and reducer computations
Data parallel algorithms are most appropriate for
MapReduce implementations
Not necessarily the most optimal implementation for a
specific algorithm
Other specialized non-MapReduce implementations exist for
some algorithms, which may be better
MR may not be the appropriate framework for exact
solutions of non data parallel/sequential algorithms
Approximate solutions using MapReduce may be good enough
16. Types of learning in MapReduce
• Three common types of learning models using
MapReduce framework
1. Parallel training of multiple models
– Train either in mappers or reducers
2. Ensemble training methods
– Train multiple models and combine them
3. Distributed learning algorithms
– Learn using both mappers and reducers
Use the Grid as a
large cluster
of independent
machines
(with fault
tolerance)
17. Parallel training of multiple models
Train multiple models simultaneously using a learning
algorithm that can be learnt in memory
Useful when individual models are trained using a
subset, filtered or modification of raw data
Can train 1000`s of models simultaneously
Essentially, treat Grid as a large cluster of machines
– Leverage fault tolerance of Hadoop
Train 1 model in each reducer
– Map:
Input: All data
Filters subset of data relevant for each model training
Output: <model_index, subset of data for training this model>
– Reduce
Train model on data corresponding to that model_index
18. Apache Mahout
Scalable to large data sets. Our core algorithms for clustering, classification and
collaborative filtering are implemented on top of scalable, distributed systems.
However, contributions that run on a single machine are welcome as well.
Scalable to support your business case. Mahout is distributed under a
commercially friendly Apache Software license.
Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse
community to facilitate discussions not only on the project itself but also on potential
use cases. Come to the mailing lists to find out more.
Currently Mahout supports mainly three use cases: Recommendation mining takes
users' behavior and from that tries to find items users might like. Clustering takes
e.g. text documents and groups them into groups of topically related documents.
Classification learns from existing categorized documents what documents of a
specific category look like and is able to assign unlabelled documents to the
(hopefully) correct category.
25 April 2014 - Goodbye MapReduce
The Mahout community decided to move its codebase onto modern data processing systems that offer a richer
programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new
MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce
algorithms in the codebase and maintain them.
We are building our future implementations on top of a DSL for linear algebraic operations which has been
developed over the last months. Programs written in this DSL are automatically optimized and executed in
parallel on Apache Spark.
Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into
Mahout.
Apache Spark™ is a fast and general engine for large-scale data processing.
H2O is the open source in memory solution from 0xdata for predictive analytics on big data.
19. Matrix
Methods
Slides with bit.ly/10SIe1A
Code github.com/dgleich/matrix-Hadoop hadoop-tutorial
DAVID F.
GLEICH ASSISTANT PROFESSOR
COMPUTER SCIENCE
PURDUE UNIVERSITY
David Gleich á Purdue bit.ly/10SIe1A
1
21. ACM KDD 2014
24-27/08
New environments: Microsoft Azure ML Studio, Google
Prediction API,…
2 Research Sessions + Industry & Government
Statistical Techniques for Big Data
Scaling-up Methods for Big Data
Topic Modeling
22. Big data & machine learning
This is a huge field, growing very fast
Many algorithms and techniques:
can be seen as a giant toolbox with wide-ranging applications
Ranging from the very simple to the extremely sophisticated
Difficult to see the big picture
Huge range of applications
Math skills are crucial