BIG DATA SCIENCE
“The price of light is far less than the cost of darkness”
Chandan Rajah [ @ChandanRajah ]
BENEFITS OF BIG DATA
COST SPEED
AGILITY CAPABILITY
Steps to the EPIPHANY
WHERE
WHAT WHY
DEMO
What is Big Data ?
Big Data ≠ Data Volume
Big Data = Crude Oil
Think of data like ‘Crude Oil’
Big Data is about extracting ‘crude oil’; transporting it in ‘pipelines’; storing it in ‘mega tanks’
What is Data Science ?
Data Science ≠ Statistical Analysis
Data Science = Oil Refinery
Data science is about ‘treating’ data; applying ‘science’ to the data;
Refine the data ‘results’; and combine to form ‘insight’
Knowns, Unknowns & DIKUW FTW!
known knowns
we know we know
known unknowns
we know we don’t know
unknown unknowns
we don’t know we don’t know
D
DATA
I
INFORMATION
K
KNOWLEDGE
W
WISDOM
U
UNDERSTANDING
raw what how to why when
numbers description experience cause & effect prediction
letters context tested proven what’s best
symbols relationship instruction
signals reports programs models
PAST FUTURE
Data Engineer Data Analyst Data Miner Data Scientist
known knowns
known unknowns unknown unknowns
Data Analytics to Data Discovery ?
data you know
data you don’t know
questionsyou’reasking
questionsyou’renotasking
Data Analyst
Data Scientist
Data
Analytics
Data Discovery
DATA MODELLING
Y  F( X, random noise, parameters)
ALGORITHMIC MODELLING
Y  [ BLACK BOX ]  X
DIVIDE
SCATTER
Split Data in Block
Replicate and Store
Petabytes of Resilience
CONQUER
EXPLORE
1000s of Parallel Threads
Explore Every Path
Machine Learning
INSIGHT
GATHER
Real Time Action
Periodic Dashboards
Iterative Evolution
What is the Big Idea ?
Divide = HDFS
Name Node
1 32
Client 1. Create Metadata
2. Put Blocks
Data Nodes
Control / Monitoring
1 1
2 2
3 3
WRITE
Name Node
1 1 1 2
2
2
3 3 34
4 4
Client 1. Get Metadata
2. Fetch Blocks
Data Nodes
Control / Monitoring
READ
Conquer = MapReduce
Insight = Functional Paradigm
Steps to the EPIPHANY
WHERE
WHAT WHY
DEMO
Why is Big Data needed ?
VOLUME VELOCITY VARIETY
Exponential growth; 2x in 2 yrs
PB (1000 TB) is now common
Event streams; never at rest
640k GB per internet minute
100s of data sources
85% not in a table
Where in the Value Chain ?
Generation Transport Knowledge Output Value
BIG DATA SCIENCE
Straddles all four Challenge Areas
Steps to the EPIPHANY
WHERE
WHAT WHY
DEMO
Big Data Heat Map – Gartner 2012
Big Data Potential by Sector – McKinsey for USBLS, 2011
Big Data Investment by Industry – Gartner, 2012
Top Big Data Challenges – Gartner, 2012
Survey on Big Data Investments – IDG Survey, 2013
Survey on Main Drivers to Invest – IDG Survey, 2014
Steps to the EPIPHANY
WHERE
WHAT WHY
DEMO
DEMO
RECAP OF BENEFITS
COST SPEED
AGILITY CAPABILITY
LAST WORDS OF WISDOM
NOT ALL ROADS LEAD TO ROME
TIME VALUE OF DATA KNOWLEDGE IS POWER
I AM AN INDIVIDUAL
“The price of light is far less than the cost of darkness”

Steps to the Big Data Science Epiphany

  • 1.
    BIG DATA SCIENCE “Theprice of light is far less than the cost of darkness” Chandan Rajah [ @ChandanRajah ]
  • 2.
    BENEFITS OF BIGDATA COST SPEED AGILITY CAPABILITY
  • 3.
    Steps to theEPIPHANY WHERE WHAT WHY DEMO
  • 4.
    What is BigData ? Big Data ≠ Data Volume Big Data = Crude Oil Think of data like ‘Crude Oil’ Big Data is about extracting ‘crude oil’; transporting it in ‘pipelines’; storing it in ‘mega tanks’
  • 5.
    What is DataScience ? Data Science ≠ Statistical Analysis Data Science = Oil Refinery Data science is about ‘treating’ data; applying ‘science’ to the data; Refine the data ‘results’; and combine to form ‘insight’
  • 6.
    Knowns, Unknowns &DIKUW FTW! known knowns we know we know known unknowns we know we don’t know unknown unknowns we don’t know we don’t know D DATA I INFORMATION K KNOWLEDGE W WISDOM U UNDERSTANDING raw what how to why when numbers description experience cause & effect prediction letters context tested proven what’s best symbols relationship instruction signals reports programs models PAST FUTURE Data Engineer Data Analyst Data Miner Data Scientist known knowns known unknowns unknown unknowns
  • 7.
    Data Analytics toData Discovery ? data you know data you don’t know questionsyou’reasking questionsyou’renotasking Data Analyst Data Scientist Data Analytics Data Discovery DATA MODELLING Y  F( X, random noise, parameters) ALGORITHMIC MODELLING Y  [ BLACK BOX ]  X
  • 8.
    DIVIDE SCATTER Split Data inBlock Replicate and Store Petabytes of Resilience CONQUER EXPLORE 1000s of Parallel Threads Explore Every Path Machine Learning INSIGHT GATHER Real Time Action Periodic Dashboards Iterative Evolution What is the Big Idea ?
  • 9.
    Divide = HDFS NameNode 1 32 Client 1. Create Metadata 2. Put Blocks Data Nodes Control / Monitoring 1 1 2 2 3 3 WRITE Name Node 1 1 1 2 2 2 3 3 34 4 4 Client 1. Get Metadata 2. Fetch Blocks Data Nodes Control / Monitoring READ
  • 10.
  • 11.
  • 12.
    Steps to theEPIPHANY WHERE WHAT WHY DEMO
  • 13.
    Why is BigData needed ? VOLUME VELOCITY VARIETY Exponential growth; 2x in 2 yrs PB (1000 TB) is now common Event streams; never at rest 640k GB per internet minute 100s of data sources 85% not in a table
  • 14.
    Where in theValue Chain ? Generation Transport Knowledge Output Value BIG DATA SCIENCE Straddles all four Challenge Areas
  • 15.
    Steps to theEPIPHANY WHERE WHAT WHY DEMO
  • 16.
    Big Data HeatMap – Gartner 2012
  • 17.
    Big Data Potentialby Sector – McKinsey for USBLS, 2011
  • 18.
    Big Data Investmentby Industry – Gartner, 2012
  • 19.
    Top Big DataChallenges – Gartner, 2012
  • 20.
    Survey on BigData Investments – IDG Survey, 2013
  • 21.
    Survey on MainDrivers to Invest – IDG Survey, 2014
  • 22.
    Steps to theEPIPHANY WHERE WHAT WHY DEMO
  • 23.
  • 24.
    RECAP OF BENEFITS COSTSPEED AGILITY CAPABILITY
  • 25.
    LAST WORDS OFWISDOM NOT ALL ROADS LEAD TO ROME TIME VALUE OF DATA KNOWLEDGE IS POWER I AM AN INDIVIDUAL
  • 26.
    “The price oflight is far less than the cost of darkness”

Editor's Notes

  • #3 COST – 20x less per TB v/s Teradata, Netezza, Oracle – 75% less average marginal cost per capacity SPEED – 10x faster than Teradata, Netezza AGILITY – 115% lesser average cost per data source v/s Oracle SCIENCE – Machine learning, prediction
  • #4 WHAT - What is Big Data Science? WHY - Why is it needed? WHERE - Where is it being used? HOW - How will it evolve?
  • #13 WHAT - What is Big Data Science? WHY - Why is it needed? WHERE - Where is it being used? HOW - How will it evolve?
  • #16 WHAT - What is Big Data Science? WHY - Why is it needed? WHERE - Where is it being used? HOW - How will it evolve?
  • #23 WHAT - What is Big Data Science? WHY - Why is it needed? WHERE - Where is it being used? HOW - How will it evolve?
  • #25 COST – 20x less per TB v/s Teradata, Netezza, Oracle – 75% less average marginal cost per capacity SPEED – 10x faster than Teradata, Netezza AGILITY – 115% lesser average cost per data source v/s Oracle SCIENCE – Machine learning, prediction
  • #26 TIME VALUE - Yesterday’s data is less valuable than today’s data - Historical data is more valuable than just now alone POWER - Get from unknown unknowns to known unknowns or known knowns is powerful LEAD TO ROME - Exploring with no direct business impact is not a bad thing INDIVUDUAL - Treat every customer as an individual not an aggregate and analyse - Aggregate only individual insights