SlideShare a Scribd company logo
1 of 47
Outline
• Big Data
• Big Data
Processing
– To Know
– To explain
– To forecast
• Conclusion
Photo by John Trainoron Flickr
http://www.flickr.com/photos/trainor/2902023575/, Licensed under CC
A day in your life
 Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
 There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
Data Avalanche/ Moore’s law of data
• We are now collecting and converting large amount
of data to digital forms
• 90% of the data in the world today was created
within the past two years.
• Amount of data we have doubles very fast
Internet of Things
• Currently physical world
and software worlds are
detached
• Internet of things
promises to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in
your carpet.. Yes even in
your socks
– Google IO pressure mats
What can we do with Big Data?
• Optimize
– 1% saving in Airplanes and
turbines can save more than
1B$ each year (GE talk, Statra
2014). SL GDP 60B
• Save lives
– Weather, Disease identification,
Personalized treatment
• Technology advancement/
Understanding
– Most high tech work are done
via simulations
Big Data Architecture
Why Big Data is hard?
• How store? Assuming 1TB bytes it
takes 1000 computers to store a 1PB
• How to move? Assuming 10Gb
network, it takes 2 hours to copy 1TB,
or 83 days to copy a 1PB
• How to search? Assuming each record
is 1KB and one machine can process
1000 records per sec, it needs 277CPU
days to process a 1TB and 785 CPU
years to process a 1 PB
• How to process?
– How to convert algorithms to work in large
size
– How to create new algorithms
http://www.susanica.com/photo/9
Why it is hard (Contd.)?
• System build of many
computers
• That handles lots of data
• Running complex logic
• This pushes us to frontier of
Distributed Systems and
Databases
• More data does not mean
there is a simple model
• Some models can be complex
as the system
http://www.flickr.com/photos/mariachily/5250487136,
Licensed CC
Tools for Processing
Data
Big data Processing Technologies
Landscape
MapReduce/ Hadoop
• First introduced by Google,
and used as the processing
model for their architecture
• Implemented by opensource
projects like Apache Hadoop
and Spark
• Users writes two functions:
map and reduce
• The framework handles the
details like distributed
processing, fault tolerance,
load balancing etc.
• Widely used, and the one of
the catalyst of Big data
void map(ctx, k, v){
tokens = v.split();
for t in tokens
ctx.emit(t,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
MapReduce (Contd.)
Histogram of Speeds
Histogram of Speeds(Contd.)
void map(ctx, k, v){
event = parse(v);
range = calcuateRange(event.v);
ctx.emit(range,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
Hadoop Landscape
• HIVE - Query data using SQL style queries, and Hive will
convert them to MapReduce jobs and run in Hadoop.
• Pig - We write programs using data flow style scripts, and
Pig convert them to MapReduce jobs and run in Hadoop.
• Mahout
– Collection of MapReduce jobs implementing many Data
mining and Artificial Intelligence algorithms using MapReduce
A = load 'hdi-data.csv' using PigStorage(',')
AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:int, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
hive> SELECT country, gni from HDI WHERE gni > 2000;
Is Hadoop Enough?
• Limitations
– Takes time for processing
– Lack of Incremental processing
– Weak with Graph usecases
– Not very easy to create a processing pipeline (addressed
with HIVE, Pig etc.)
– Too close to programmers
– Faster implementations are possible
• Alternatives
– Apache Drill http://incubator.apache.org/drill/
– Spark http://spark-project.org/
– Graph Processors like http://giraph.apache.org/
Spark
• New Programming
Model built on
functional
programming concepts
• Can be much faster for
recursive usecases
• Have a complete stack
of products
file =
spark.textFile("hdfs://...”)
file.flatMap(
line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
Real-time Analytics
• Idea is to process data as they are
received in streaming fashion
• Used when we need
– Very fast output
– Lots of events (few 100k to millions)
– Processing without storing (e.g. too
much data)
• Two main technologies
– Stream Processing (e.g. Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP)
http://wso2.com/products/complex-
event-processor/
Complex Event Processing (CEP)
• Sees inputs as Event streams and queried with
SQL like language
• Supports Filters, Windows, Join, Patterns and
Sequences
from p=PINChangeEvents#win.time(3600) join
t=TransactionEvents[p.custid=custid][amount>1000
0] #win.time(3600)
return t.custid, t.amount;
DEBS Challenge
• Event Processing
challenge
• Real football game,
sensors in player
shoes + ball
• Events in 15k Hz
• Event format
– Sensor ID, TS, x, y, z, v,
a
• Queries
– Running Stats
– Ball Possession
– Heat Map of Activity
– Shots at Goal
Example: Detect ball Possession
• Possession is time a
player hit the ball
until someone else
hits it or it goes out
of the ground
from Ball#window.length(1) as b join
Players#window.length(1) as p
unidirectional
on debs: getDistance(b.x,b.y,b.z,
p.x, p.y, p.z) < 1000
and b.a > 55
select ...
insert into hitStream
from old = hitStream ,
b = hitStream [old. pid != pid ],
n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]
or e2= ballLeavingHitStream)
select ...
insert into BallPossessionStream
http://www.flickr.com/photos/glennharper/146164820/
Information Visualization
• Presenting information
– To end user
– To decision takers
– To scientist
• Interactive exploration
• Sending alerts
http://www.flickr.com/photos/stevefae
embra/3604686097/
Napoleon's Russian Campaign
Show Han Rosling’s Data
• This use a tool called “Gapminder”
GNU Plot
set terminal png
set output "buyfreq.png"
set title "Frequency Distribution of Items
brought by Buyer";
setylabel "Number of Items Brought";
setxlabel "Buyers Sorted by Items count";
set key left top
set log y
set log x
plot "1.data" using 2 title "Frequency"
with linespoints
• Open source
• Very powerful
Web Visualization Libraries
• D3, Flot etc.
• Could
generate
compelling
graphs
Solving the Problem
Making Sense of Data
• To know (what happened?)
– Basic analytics +
visualizations (min, max,
average, histogram,
distributions … )
– Interactive drill down
• To explain (why)
– Data mining, classifications,
building models, clustering
• To forecast
– Neural networks, decision
models
To know (what happened/
happening?)
• Analytics Implemented
with MapReduce or
Queries
– Min, Max, average,
correlation, histograms
– Might join group data in
many ways
• Data is often presented
with some visualizations
• Support for drill down
• Real-time analytics to
track things fast enough
and Alerts.
http://www.flickr.com/photos/isriya/2967310
333/
Analytics/ Summary
• Min, Max, Mean, Standard deviation,
correlation, grouping etc.
• E.g. Internet Access logs
– Number of hits, total, by day or hour, trend over
time
– File size distribution
– Request Success to failure ratio
– Hits by client IP region
http://pixabay.com, by Steven Iodice
Drill Down
• Idea is to let users drill down and explore a
view of the data (called BI or OLAP)
• Users go and define 3 dimensions (or more),
and tool let users explore the cube and only
look at subset of data.
• E.g. tool: Mondrian
http://pixabay.com, by Steven Iodice
Search
• Process and Index the data. The killer app of
our time.
http://pixabay.com, by Steven Iodice
• Web Search
• Graph Search
• Semantic Search
• Drug Discovery
• …
Search: New Ideas
• Apache Drill, Big Query
• Blink: search via sampling
To Explain (Patterns)
• Understanding underline
patterns from data
• Correlation
– Scatter plot, Regression
• Data Mining (Detecting
Patterns)
– Clustering
– Dimension Reduction
– Finding Similar items
– Finding Hubs and authorities in
a Graph
– Finding frequent item sets
– Making recommendation
http://www.flickr.com/photos/eriwst/2987739376/ and http://www.flickr.com/photos/focx/5035444779/
Usecase 1: Clustering
• Cluster by location
• Cluster by similarity
• Cluster to identify
clicks (e.g. fast
information
distribution,
forensic)
• E.g. Earth Quake
Data
Usecase 2: Regression
• Find the relationship
between variables
• E.g. find the trend
• E.g. relationship
between money
spent and Education
level improvement
Tools
• Weka – java machine learning library
• Mahout – MapReduce implementation of
Machine learning algorithms
• R – programming language for statistical
computing
• PMML (Predictive model markup language)
• Scaling these libraries is a challenge
Forecasts and Models
• Trying to build a model for the data
• Theoretically or empirically
– Analytical models (e.g. Physics, but
PDEs kill you)
– Simulations (Numerical methods)
– Regression Models
– Time Series Models
– Neural network models
– Reinforcement learning
– Classifiers
– dimensionality reduction, kernel
methods)
• Examples
– Translation
– Weather Forecast models
– Building profiles of users
– Traffic models
– Economic models
http://misterbijou.blogspot.com/201
0_09_01_archive.html
Usecase 1: Modeling Solar System
• PDEs not solvable
• Simulations
• Other examples: Weather
Usecase 2: Electricity Demand Forecast
• Time Series
• Moving averages
• Cycles
Usecase 3: Targeted Marketing
• Collect data and build a model on
– What user like
– What he has brought
– What is his buying power
• Making recommendations
• Giving personalized deals
Big data lifecycle
• Realizing the big data lifecycle is hard
• Need wide understanding about many fields
• Big data teams will include members from
many fields working together
• Data Scientists: A new role, evolving Job
description
• His Job is to make sense of data, by using Big data
processing, advance algorithms, Statistical
methods, and data mining etc.
• Likely to be a highest paid/ cool Jobs.
• Big organizations (Google, Facebook, Banks) hiring
Math PhD by a lot for this.
Data Scientist
Statisticians
System experts
Domain Experts
DB experts
AI experts
Challenges
• Fast scalable real-time analytics
• Merging real-time and batch processing
• Good libraries for machine learning
• Scaling decision models e.g. machine learning
Conclusions
• Big data? How, Why and Bigdata architecture
• Data Processing Tools
– MapReduce, CEP, visualizations
• Solving problems
– To know what happened
– To explain what happened
– To predict what will happen
• Conclusions
Questions?

More Related Content

What's hot

Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualizationbigdataviz_bay
 
[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩NAVER D2
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Thingselephantscale
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Visualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple SourcesVisualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple SourcesData Driven Innovation
 

What's hot (20)

Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Visualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple SourcesVisualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple Sources
 

Viewers also liked

Accelerating Mobile Development with Mobile Enterprise Application Platforms ...
Accelerating Mobile Development with Mobile Enterprise Application Platforms ...Accelerating Mobile Development with Mobile Enterprise Application Platforms ...
Accelerating Mobile Development with Mobile Enterprise Application Platforms ...Srinath Perera
 
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataStrata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataSrinath Perera
 
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Lora Cecere
 
Security issues associated with big data in cloud
Security issues associated  with big data in cloudSecurity issues associated  with big data in cloud
Security issues associated with big data in cloudsornalathaNatarajan
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM
 
Big Data Analytics in Energy & Utilities
Big Data Analytics in Energy & UtilitiesBig Data Analytics in Energy & Utilities
Big Data Analytics in Energy & UtilitiesAnders Quitzau
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 

Viewers also liked (10)

Accelerating Mobile Development with Mobile Enterprise Application Platforms ...
Accelerating Mobile Development with Mobile Enterprise Application Platforms ...Accelerating Mobile Development with Mobile Enterprise Application Platforms ...
Accelerating Mobile Development with Mobile Enterprise Application Platforms ...
 
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataStrata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big Data
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
 
Security issues associated with big data in cloud
Security issues associated  with big data in cloudSecurity issues associated  with big data in cloud
Security issues associated with big data in cloud
 
Big Data (security Issue)
Big Data (security Issue)Big Data (security Issue)
Big Data (security Issue)
 
Big data security
Big data securityBig data security
Big data security
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo Project
 
Big Data Analytics in Energy & Utilities
Big Data Analytics in Energy & UtilitiesBig Data Analytics in Energy & Utilities
Big Data Analytics in Energy & Utilities
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 

Similar to Big Data Analysis : Deciphering the haystack

Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution WSO2
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 

Similar to Big Data Analysis : Deciphering the haystack (20)

Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Big data
Big dataBig data
Big data
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 

More from Srinath Perera

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingSrinath Perera
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the EnterpriseSrinath Perera
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs Srinath Perera
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsSrinath Perera
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesSrinath Perera
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?Srinath Perera
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsSrinath Perera
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Srinath Perera
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of BlockchainSrinath Perera
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesSrinath Perera
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata EraSrinath Perera
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksSrinath Perera
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeSrinath Perera
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies TimelineSrinath Perera
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglySrinath Perera
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through AnalyticsSrinath Perera
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 

More from Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Recently uploaded

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Big Data Analysis : Deciphering the haystack

  • 1.
  • 2. Outline • Big Data • Big Data Processing – To Know – To explain – To forecast • Conclusion Photo by John Trainoron Flickr http://www.flickr.com/photos/trainor/2902023575/, Licensed under CC
  • 3. A day in your life  Think about a day in your life? – What is the best road to take? – Would there be any bad weather? – How to invest my money? – How is my health?  There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/5 512461652/ CC licence
  • 4. Data Avalanche/ Moore’s law of data • We are now collecting and converting large amount of data to digital forms • 90% of the data in the world today was created within the past two years. • Amount of data we have doubles very fast
  • 5. Internet of Things • Currently physical world and software worlds are detached • Internet of things promises to bridge this – It is about sensors and actuators everywhere – In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks – Google IO pressure mats
  • 6. What can we do with Big Data? • Optimize – 1% saving in Airplanes and turbines can save more than 1B$ each year (GE talk, Statra 2014). SL GDP 60B • Save lives – Weather, Disease identification, Personalized treatment • Technology advancement/ Understanding – Most high tech work are done via simulations
  • 8. Why Big Data is hard? • How store? Assuming 1TB bytes it takes 1000 computers to store a 1PB • How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB • How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB • How to process? – How to convert algorithms to work in large size – How to create new algorithms http://www.susanica.com/photo/9
  • 9. Why it is hard (Contd.)? • System build of many computers • That handles lots of data • Running complex logic • This pushes us to frontier of Distributed Systems and Databases • More data does not mean there is a simple model • Some models can be complex as the system http://www.flickr.com/photos/mariachily/5250487136, Licensed CC
  • 11. Big data Processing Technologies Landscape
  • 12. MapReduce/ Hadoop • First introduced by Google, and used as the processing model for their architecture • Implemented by opensource projects like Apache Hadoop and Spark • Users writes two functions: map and reduce • The framework handles the details like distributed processing, fault tolerance, load balancing etc. • Widely used, and the one of the catalyst of Big data void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • 15. Histogram of Speeds(Contd.) void map(ctx, k, v){ event = parse(v); range = calcuateRange(event.v); ctx.emit(range,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • 16. Hadoop Landscape • HIVE - Query data using SQL style queries, and Hive will convert them to MapReduce jobs and run in Hadoop. • Pig - We write programs using data flow style scripts, and Pig convert them to MapReduce jobs and run in Hadoop. • Mahout – Collection of MapReduce jobs implementing many Data mining and Artificial Intelligence algorithms using MapReduce A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; hive> SELECT country, gni from HDI WHERE gni > 2000;
  • 17. Is Hadoop Enough? • Limitations – Takes time for processing – Lack of Incremental processing – Weak with Graph usecases – Not very easy to create a processing pipeline (addressed with HIVE, Pig etc.) – Too close to programmers – Faster implementations are possible • Alternatives – Apache Drill http://incubator.apache.org/drill/ – Spark http://spark-project.org/ – Graph Processors like http://giraph.apache.org/
  • 18. Spark • New Programming Model built on functional programming concepts • Can be much faster for recursive usecases • Have a complete stack of products file = spark.textFile("hdfs://...”) file.flatMap( line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 19. Real-time Analytics • Idea is to process data as they are received in streaming fashion • Used when we need – Very fast output – Lots of events (few 100k to millions) – Processing without storing (e.g. too much data) • Two main technologies – Stream Processing (e.g. Strom, http://storm-project.net/ ) – Complex Event Processing (CEP) http://wso2.com/products/complex- event-processor/
  • 20. Complex Event Processing (CEP) • Sees inputs as Event streams and queried with SQL like language • Supports Filters, Windows, Join, Patterns and Sequences from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>1000 0] #win.time(3600) return t.custid, t.amount;
  • 21. DEBS Challenge • Event Processing challenge • Real football game, sensors in player shoes + ball • Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a • Queries – Running Stats – Ball Possession – Heat Map of Activity – Shots at Goal
  • 22. Example: Detect ball Possession • Possession is time a player hit the ball until someone else hits it or it goes out of the ground from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[b.pid == pid]*, ( e1 = hitStream[b.pid != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream http://www.flickr.com/photos/glennharper/146164820/
  • 23. Information Visualization • Presenting information – To end user – To decision takers – To scientist • Interactive exploration • Sending alerts http://www.flickr.com/photos/stevefae embra/3604686097/
  • 25. Show Han Rosling’s Data • This use a tool called “Gapminder”
  • 26. GNU Plot set terminal png set output "buyfreq.png" set title "Frequency Distribution of Items brought by Buyer"; setylabel "Number of Items Brought"; setxlabel "Buyers Sorted by Items count"; set key left top set log y set log x plot "1.data" using 2 title "Frequency" with linespoints • Open source • Very powerful
  • 27. Web Visualization Libraries • D3, Flot etc. • Could generate compelling graphs
  • 29. Making Sense of Data • To know (what happened?) – Basic analytics + visualizations (min, max, average, histogram, distributions … ) – Interactive drill down • To explain (why) – Data mining, classifications, building models, clustering • To forecast – Neural networks, decision models
  • 30. To know (what happened/ happening?) • Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms – Might join group data in many ways • Data is often presented with some visualizations • Support for drill down • Real-time analytics to track things fast enough and Alerts. http://www.flickr.com/photos/isriya/2967310 333/
  • 31. Analytics/ Summary • Min, Max, Mean, Standard deviation, correlation, grouping etc. • E.g. Internet Access logs – Number of hits, total, by day or hour, trend over time – File size distribution – Request Success to failure ratio – Hits by client IP region http://pixabay.com, by Steven Iodice
  • 32. Drill Down • Idea is to let users drill down and explore a view of the data (called BI or OLAP) • Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. • E.g. tool: Mondrian http://pixabay.com, by Steven Iodice
  • 33. Search • Process and Index the data. The killer app of our time. http://pixabay.com, by Steven Iodice • Web Search • Graph Search • Semantic Search • Drug Discovery • …
  • 34. Search: New Ideas • Apache Drill, Big Query • Blink: search via sampling
  • 35. To Explain (Patterns) • Understanding underline patterns from data • Correlation – Scatter plot, Regression • Data Mining (Detecting Patterns) – Clustering – Dimension Reduction – Finding Similar items – Finding Hubs and authorities in a Graph – Finding frequent item sets – Making recommendation http://www.flickr.com/photos/eriwst/2987739376/ and http://www.flickr.com/photos/focx/5035444779/
  • 36. Usecase 1: Clustering • Cluster by location • Cluster by similarity • Cluster to identify clicks (e.g. fast information distribution, forensic) • E.g. Earth Quake Data
  • 37. Usecase 2: Regression • Find the relationship between variables • E.g. find the trend • E.g. relationship between money spent and Education level improvement
  • 38. Tools • Weka – java machine learning library • Mahout – MapReduce implementation of Machine learning algorithms • R – programming language for statistical computing • PMML (Predictive model markup language) • Scaling these libraries is a challenge
  • 39. Forecasts and Models • Trying to build a model for the data • Theoretically or empirically – Analytical models (e.g. Physics, but PDEs kill you) – Simulations (Numerical methods) – Regression Models – Time Series Models – Neural network models – Reinforcement learning – Classifiers – dimensionality reduction, kernel methods) • Examples – Translation – Weather Forecast models – Building profiles of users – Traffic models – Economic models http://misterbijou.blogspot.com/201 0_09_01_archive.html
  • 40. Usecase 1: Modeling Solar System • PDEs not solvable • Simulations • Other examples: Weather
  • 41. Usecase 2: Electricity Demand Forecast • Time Series • Moving averages • Cycles
  • 42. Usecase 3: Targeted Marketing • Collect data and build a model on – What user like – What he has brought – What is his buying power • Making recommendations • Giving personalized deals
  • 43. Big data lifecycle • Realizing the big data lifecycle is hard • Need wide understanding about many fields • Big data teams will include members from many fields working together
  • 44. • Data Scientists: A new role, evolving Job description • His Job is to make sense of data, by using Big data processing, advance algorithms, Statistical methods, and data mining etc. • Likely to be a highest paid/ cool Jobs. • Big organizations (Google, Facebook, Banks) hiring Math PhD by a lot for this. Data Scientist Statisticians System experts Domain Experts DB experts AI experts
  • 45. Challenges • Fast scalable real-time analytics • Merging real-time and batch processing • Good libraries for machine learning • Scaling decision models e.g. machine learning
  • 46. Conclusions • Big data? How, Why and Bigdata architecture • Data Processing Tools – MapReduce, CEP, visualizations • Solving problems – To know what happened – To explain what happened – To predict what will happen • Conclusions