e.g. Targeted 
Marketing 
• Assume mass emails to 1M 
people, reaction rate of 1%, 
2$ cost per email. 
– Then cost 2M$ and reach of 
10k people. 
• Lets say that looking at 
demographics (e.g. where 
they live), you can find 
250K people with reaction 
rate of 6%, then (e.g. by 
using decision trees) 
• Then cost 500K$ and reach 
of 15k people.
A day in your life 
 Think about a day in your life? 
– What is the best road to take? 
– Would there be any bad weather? 
– How to invest my money? 
– How is my health? 
 There are many decisions that 
you can do better if only you can 
access the data and process 
them. 
http://www.flickr.com/photos/kcolwell/5 
512461652/ CC licence
Internet of Things 
• Currently physical world and 
software worlds are 
detached 
• Internet of things promises 
to bridge this 
– It is about sensors and 
actuators everywhere 
– In your fridge, in your 
blanket, in your chair, in your 
carpet.. Yes even in your 
socks 
– Umbrella that light up when 
there is rain and medicine 
cups
What can we do with Big Data? 
• Optimize (World is inefficient) 
– 30% food wasted farm to plate 
– GE 1% initiative (http://goo.gl/eYC0QE ) 
• 1% saving in trains can save 2B/ year 
• 1% in US healthcare is 20B/ year 
• In contrast, Sri Lanka total exports 9B/ year. 
• Save lives 
– Weather, Disease identification, Personalized 
treatment 
• Technology advancement 
– Most high tech research are done via simulations
Big Data Architecture
Big data Processing Technologies 
Landscape
Batch Processing 
• Store and process 
• Slow (> 5 minutes for 
results for a 
reasonable usecase) 
• Programming model is MapReduce 
– Apache Hadoop 
– Spark 
• Lot of tools built on top 
– Hive Shark for (SQL style queries), Mahout (ML), Giraph 
(Graph Processing)
Real-time Analytics 
• Idea is to process data as they are 
received in streaming fashion 
(without storing) 
• Used when we need 
– Very fast output (milliseconds) 
– Lots of events (few 100k to millions) 
• Two main technologies 
– Stream Processing (e.g. Apache 
Strom, http://storm-project.net/ ) 
– Complex Event Processing (CEP) e.g. 
WSO2 CEP 
define partition “playerPartition” as PlayerDataStream.pid; 
from PlayerDataStream#win.time(1m) 
select pid, avg(speed) as avgSpeed 
insert into AvgSpeedStream 
using partition playerPartition;
Curious Case of Missing Data 
• WW II, Returned 
Aircrafts and data 
on where they 
were hit? 
• How would you 
add Armour? 
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from 
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
Big data lifecycle 
• Get the data, clean up
Making Sense of Data 
• Hindsight (to know what 
happened) 
• Basic analytics + visualizations 
(min, max, average, histogram, 
distributions … ) 
• Oversight (to know what is 
happening and fixing it) 
– Realtime analytics 
• Insight 
– Pattern mining, Clustering, 
• Foresight 
– Neural networks, 
Classification, 
Recommendation
Usecase: Planning 
• Urban Planning 
– People distribution 
– Mobility 
– Waste Management 
– E.g. see 
http://goo.gl/jPujmM 
• Market Research 
– Buying Patterns 
– Sentiments
Usecase: Predictive Maintenance 
• Idea is to fix the problem 
before it broke, avoiding 
expensive downtimes 
– Airplanes, turbines, 
windmills 
– Construction Equipment 
– Car, Golf carts 
• How 
– Anomaly detection (deviate 
from normal operation) 
– Match against known error 
patterns
Outline
SLASSCOM TECH TALKS 
https://www.facebook.com/SlasscomTechnologyForum 
http://www.slasscom.lk/events 
https://twitter.com/slasscom 
www.slideshare.net/slasscomtechforum

Introduction to Data Processing (by Srinath Perera)

  • 2.
    e.g. Targeted Marketing • Assume mass emails to 1M people, reaction rate of 1%, 2$ cost per email. – Then cost 2M$ and reach of 10k people. • Lets say that looking at demographics (e.g. where they live), you can find 250K people with reaction rate of 6%, then (e.g. by using decision trees) • Then cost 500K$ and reach of 15k people.
  • 3.
    A day inyour life  Think about a day in your life? – What is the best road to take? – Would there be any bad weather? – How to invest my money? – How is my health?  There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/5 512461652/ CC licence
  • 5.
    Internet of Things • Currently physical world and software worlds are detached • Internet of things promises to bridge this – It is about sensors and actuators everywhere – In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks – Umbrella that light up when there is rain and medicine cups
  • 6.
    What can wedo with Big Data? • Optimize (World is inefficient) – 30% food wasted farm to plate – GE 1% initiative (http://goo.gl/eYC0QE ) • 1% saving in trains can save 2B/ year • 1% in US healthcare is 20B/ year • In contrast, Sri Lanka total exports 9B/ year. • Save lives – Weather, Disease identification, Personalized treatment • Technology advancement – Most high tech research are done via simulations
  • 7.
  • 8.
    Big data ProcessingTechnologies Landscape
  • 9.
    Batch Processing •Store and process • Slow (> 5 minutes for results for a reasonable usecase) • Programming model is MapReduce – Apache Hadoop – Spark • Lot of tools built on top – Hive Shark for (SQL style queries), Mahout (ML), Giraph (Graph Processing)
  • 10.
    Real-time Analytics •Idea is to process data as they are received in streaming fashion (without storing) • Used when we need – Very fast output (milliseconds) – Lots of events (few 100k to millions) • Two main technologies – Stream Processing (e.g. Apache Strom, http://storm-project.net/ ) – Complex Event Processing (CEP) e.g. WSO2 CEP define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m) select pid, avg(speed) as avgSpeed insert into AvgSpeedStream using partition playerPartition;
  • 11.
    Curious Case ofMissing Data • WW II, Returned Aircrafts and data on where they were hit? • How would you add Armour? http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
  • 12.
    Big data lifecycle • Get the data, clean up
  • 13.
    Making Sense ofData • Hindsight (to know what happened) • Basic analytics + visualizations (min, max, average, histogram, distributions … ) • Oversight (to know what is happening and fixing it) – Realtime analytics • Insight – Pattern mining, Clustering, • Foresight – Neural networks, Classification, Recommendation
  • 14.
    Usecase: Planning •Urban Planning – People distribution – Mobility – Waste Management – E.g. see http://goo.gl/jPujmM • Market Research – Buying Patterns – Sentiments
  • 15.
    Usecase: Predictive Maintenance • Idea is to fix the problem before it broke, avoiding expensive downtimes – Airplanes, turbines, windmills – Construction Equipment – Car, Golf carts • How – Anomaly detection (deviate from normal operation) – Match against known error patterns
  • 16.
  • 17.
    SLASSCOM TECH TALKS https://www.facebook.com/SlasscomTechnologyForum http://www.slasscom.lk/events https://twitter.com/slasscom www.slideshare.net/slasscomtechforum