Hadoop meets Mature BI: Data Scientists

@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Hadoop meets Mature BI:
Where the rubber meets the road for
Data Scientists
Michael Hiskey
Futurist, + Product Evangelist
VP, Marketing & Business Development
Kognitio

The Data Scientist
Sexiest job of the 21st Century?

Key Concept: Graduation
Projects will need
to Graduate
from the
Data Science Lab
and become part
of
Business as Usual

Demand for the Data Scientist
Organizational appetite for tens, not hundreds
© EMC Corporation and The Guardian UK™ http://www.guardian.co.uk/news/datablog/2012/mar/02/data‐scientist#zoomed‐picture

Don’t be a Railroad Stoker!
Highly skilled engineering required …
but the world innovated around them.

Business Intelligence
Numbers
Tables
Charts
Indicators
Time
‐ History
‐ Lag
Access
‐ to view (portal)
‐ to data
‐ to depth
‐ Control/Secure
Consumption
‐ digestion
…with ease and simplicity
Straddle IT and Business
Faster
Lower latency
More granularity
Richer data model
Self service

What has changed?
More
connected-users?
More-connected
users?

According to one
estimate, mankind
created 150 exabytes
of data in 2005
(billion gigabytes)
In 2010 this was
1,200 exabytes

@Kognitio @mphnyc #OANYC
Data Variety

Respondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have been
abbreviated, and selections have been normalized to equal 100%. n=1144
Source: IBM Institute for Business Value/Said Business School Survey
What?
New value comes from your existing data

Hadoop ticks many but not all the boxes

 No need to pre‐process
 No need to align to schema
 No need to triage
Null storage concerns

Machine learning
algorithms Dynamic
Simulation
Statistical
Analysis
Clustering
Behaviour
modelling
The drive for deeper understanding
Reporting & BPM
Fraud detection
Dynamic
Interaction
Technology/Automation
Analytical Complexity
Campaign
Management
#MPP_R

Hadoop just too
slow for interactive
BI!
…loss of train‐
of‐thought
“while hadoop shines as a processing
platform, it is painfully slow as a query tool”

Analytics needs
low latency, no I/O wait
High speed in‐memory processing

Analytical Platform: Reference Architecture
Analytical
Platform
Layer
Near‐line
Storage
(optional)
Application &
Client Layer
All BI Tools All OLAP Clients Excel
Persistence
Layer Hadoop
Clusters
Enterprise Data
Warehouses
Legacy
Systems
…
Reporting
Cloud
Storage

The Future
Big DataAdvanced Analytics
In-memory
Logical Data Warehouse
Predictive Analytics
Data Scientists

connect
www.kognitio.com
twitter.com/kognitiolinkedin.com/companies/kognitio
tinyurl.com/kognitio youtube.com/kognitio
NA: +1 855 KOGNITIO
EMEA: +44 1344 300 770
THESE SLIDES: www.slideshare.net/Kognitio

@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #MPP_R
Hadoop meets Mature BI:
Where the rubber meets the road for
Data Scientists
• The key challenge for Data Scientists is not the proliferation of their
roles, but the ability to ‘graduate’ key Big Data projects from the
‘Data Science Lab’ and production‐ize them into their broader
organizations.
• Over the next 18 months, "Big Data' will become just "Data"; this
means everyone (even business users) will need to have a way to
use it ‐ without reinventing the way they interact with their current
reporting and analysis.
• To do this requires interactive analysis with existing tools and
massively parallel code execution, tightly integrated with Hadoop.
Your Data Warehouse is dying; Hadoop will elicit a material shift
away from price per TB in persistent data storage.

The new bounty hunters:
Drill
Impala
Pivotal
Stinger
The No SQL Posse
WANTED
DEAD OR ALIVE
SQL

It’s all about getting work done
Used to be simple fetch of value
Tasks evolving:
Then was calc dynamic aggregate
Now complex algorithms!

@Kognitio @mphnyc #MPP_R
create external script LM_PRODUCT_FORECAST environment rsint
receives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES
partition by PRODNO order by PRODNO, ROW_ID
sends ( R_OUTPUT varchar )
isolate partitions
script S'endofr( # Simple R script to run a linear fit on daily sales
prod1<-read.csv(file=file("stdin"), header=FALSE,row.names
colnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")
dim1<-dim(prod1)
daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW),
daily1[,2]<-daily1[,2]/sum(daily1[,2])
basesales<-array(0,c(dim1[1],2))
basesales[,1]<-prod1$ID
basesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])
colnames(basesales)<-c("ID","BASESALES")
fit1=lm(BASESALES ~ ID,as.data.frame(basesales))
select Trans_Year, Num_Trans,
count(distinct Account_ID) Num_Accts,
sum(count( distinct Account_ID)) over (partition by Trans_Year
cast(sum(total_spend)/1000 as int) Total_Spend,
cast(sum(total_spend)/1000 as int) / count(distinct Account_ID
rank() over (partition by Trans_Year order by count(distinct A
rank() over (partition by Trans_Year order by sum(total_spend)
from( select Account_ID,
Extract(Year from Effective_Date) Trans_Year,
count(Transaction_ID) Num_Trans,
sum(Transaction Amount) Total Spend,
select dept, sum(sales)
from sales_fact
Where period between date ‘01-05-2006’ and date ‘31-05-2006’
group by dept
having sum(sales) > 50000;
select sum(sales)
from sales_history
where year = 2006 and month = 5 and region=1;
select total_sales
from summary
where year = 2006 and month = 5 and region=1;
Behind the
numbers

For once technology is on our side
First time we have full triumvirate of
– Excellent Computing power
– Unlimited storage
– Fast Networks
…now that RAM is cheap!

Lots of these
Not so many of these
Hadoop is…
Hadoop inherently disk oriented
Typically low ratio of CPU to Disk

Hadoop meets Mature BI: Data Scientists

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Hadoop meets Mature BI: Data Scientists

Similar to Hadoop meets Mature BI: Data Scientists (20)

Recently uploaded

Recently uploaded (20)

Hadoop meets Mature BI: Data Scientists