Hadoop forHadoop for
D S i @BTData Science @BT
Big Data is transforming the economics of data processing 
In 1980 storing and querying a year’s 
worth of twitter data (i...
First operational Hadoop cluster at Tinsley Park
Rest of BT Openreach
• 2 x 1.2 PByte Hadoop 
cluster
• Logical separation...
Research cluster  usage
© British Telecommunications plc
4
What are we doing with Hadoop
Data science to understand our networks and physical assets
• Fleet (vehicle maintenance / p...
Predictive ModellingPredictive Modelling
in Hadoop
© British Telecommunications plc
6
From model development to production
Source data
build model
101011101010101001
000001110101111101
111110100101110000
1000...
In‐Hadoop model scoring
(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovits...
Example of building a predictive model
Example adapted from A Handbook of Statistical Analyses using R, 2nd ed.
• Goal ‐ P...
Build a predictive model in R
library(rpart)
lib ( t kit)library(partykit)
# get the bodyfat data from the mboost package
...
5‐parameter regression tree
© British Telecommunications plc
11
In‐Hadoop model scoring
(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovits...
Model execution in Hive
• Use Hive streaming
– TRANSFORM() allows Hive to pipe data through a user provided script
– We wi...
Model execution in Hive
Research cluster
HP BL460 blades
2 master nodes
Ti t
2 master nodes
6 worker nodes
Time to
score
d...
Persisting ModelsPersisting Models
in Hadoop
© British Telecommunications plc
16
Persisting Models in Hadoop
Serialize R model, store in DB, fetch at run time via RJDBC
• Specify model(s) to fetch with a...
Persisting and stream models via Hive
© British Telecommunications plc
18
Results…ongoing
Difficult to 
scheduleschedule
Testing…
~10 minutes to score:
10 million records
1 cached model – regressi...
Extracting maximum value from Big Data
Plumbing
D i tt d t i tDesign patterns: data ingest, 
incremental processing, file ...
Bits and pieces….
R Markdown
• Capture analytic workflow in a reproducible way
• Publish documents to interested community...
Observatory
Data lifecycle: Analysis:Data lifecycle:
Tools to support 
discovery, 
provenance, 
comprehension
Analysis:
Ca...
Q i ?Questions?
Upcoming SlideShare
Loading in …5
×

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

1,622 views

Published on

By Rob Claxton, Chief Researcher in Big Data

Video at: https://www.youtube.com/watch?v=gMUSDRljGM8&index=2&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,622
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
35
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

  1. 1. Hadoop forHadoop for D S i @BTData Science @BT
  2. 2. Big Data is transforming the economics of data processing  In 1980 storing and querying a year’s  worth of twitter data (if it existed)  would have required a spend of 75% of the Apollo program. In 2014 it costs approximately theIn 2014 it costs approximately the  same as a small used car (£6k).
  3. 3. First operational Hadoop cluster at Tinsley Park Rest of BT Openreach • 2 x 1.2 PByte Hadoop  cluster • Logical separation  between Openreach and rest of BT • Based on learning  from Research  Hadoop cluster © British Telecommunications plc 3 Research cluster activity
  4. 4. Research cluster  usage © British Telecommunications plc 4
  5. 5. What are we doing with Hadoop Data science to understand our networks and physical assets • Fleet (vehicle maintenance / predictive parts ordering) • Buildings (energy consumption, security) • Networks (inventory, faults…) 'ETL' • Ingest, transformation… li• Data quality StStorage • Low cost, computational file store © British Telecommunications plc 5
  6. 6. Predictive ModellingPredictive Modelling in Hadoop © British Telecommunications plc 6
  7. 7. From model development to production Source data build model 101011101010101001 000001110101111101 111110100101110000 100011111001000011 010101010000101011 111010101101010101 010101010110101010 evaluate model production  implementation In productionmodify model p execute/monitor © British Telecommunications plc 7
  8. 8. In‐Hadoop model scoring (Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman) © British Telecommunications plc 8
  9. 9. Example of building a predictive model Example adapted from A Handbook of Statistical Analyses using R, 2nd ed. • Goal ‐ Predict body fat content from common anthropometric measurementsGoal  Predict body fat content from common anthropometric measurements • Reference measurement made using Dual Energy X‐ray Absorptiometry (DXA) – This is very accurate– This is very accurate – Isn’t practical for general use due to high costs and methodological effort • Therefore useful to have a simple method to predict DXA measurement of body fatTherefore useful to have a simple method to predict DXA measurement of body fat age DEXfat waistcirc hipcirc elbowbreadth kneebreadth 57 41.68 100.0 112.0 7.1 9.4 65 43.29 99.5 116.5 6.5 8.9 59 35.41 96.0 108.5 6.2 8.9 58 22.79 72.0 96.5 6.1 9.2 60 36.42 89.5 100.5 7.1 10.0 © British Telecommunications plc 9 … … … … … …
  10. 10. Build a predictive model in R library(rpart) lib ( t kit)library(partykit) # get the bodyfat data from the mboost package data("bodyfat", package="mboost") # build the model bodyfat_rpart = rpart(DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth, data=bodyfat, control = rpart.control(minsplit=10)) # plot initial regression tree plot(as party(bodyfat rpart) tp args=list(id=FALSE))plot(as.party(bodyfat_rpart), tp_args=list(id=FALSE)) # save the model model = bodyfat_rpart save(model, file = './model.Rdata') # save the test data drops=c("DEXfat", "anthro3a", "anthro3b", "anthro3c", "anthro4") bodyfat.testdata = bodyfat[,!(names(bodyfat) %in% drops)] write.table(bodyfat.testdata, "./testdata.csv", sep=",", row.names=F, col.names=F) © British Telecommunications plc 10
  11. 11. 5‐parameter regression tree © British Telecommunications plc 11
  12. 12. In‐Hadoop model scoring (Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman) © British Telecommunications plc 12
  13. 13. Model execution in Hive • Use Hive streaming – TRANSFORM() allows Hive to pipe data through a user provided script – We will pipe data through ‘scorer.R’, the wrapper code around our regression tree model   hive> ADD FILE ./model.Rdata ADD FILE / R Make the model file  and scoring script ADD FILE ./scorer.R INSERT OVERWRITE TABLE default.modelscore SELECT TRANSFORM( t.age, t.waistcirc, The results of the  query will go into this  table in Hive and scoring script  available to Hive , t.hipcirc, t.elbowbreadth, t.kneebreadth ) USING'scorer.R' AS age, score FROM default modelinput t; Model parameters  are passed to scorer.R Schema of the results default.modelinput t; The source data  Hive will stream the  © British Telecommunications plc 14 comes from this table  in Hive source data through  this script
  14. 14. Model execution in Hive Research cluster HP BL460 blades 2 master nodes Ti t 2 master nodes 6 worker nodes Time to score data (s) Half a billion  records scored inrecords scored in  under 4 minutes Number of records (millions) 183 Mbyte 8.9 GByte © British Telecommunications plc 15
  15. 15. Persisting ModelsPersisting Models in Hadoop © British Telecommunications plc 16
  16. 16. Persisting Models in Hadoop Serialize R model, store in DB, fetch at run time via RJDBC • Specify model(s) to fetch with arguments to wrapper script • Deserialize models once ‐ minimal overhead • Model(s) applied to all input records Serialize R models, store and stream via Hive • Pre‐join input records with model in the limit – a different model per recordPre join input records with model  in the limit  a different model per record • Stream records/models via Hive to wrapper script • Cache models in wrapper scriptpp p © British Telecommunications plc 17
  17. 17. Persisting and stream models via Hive © British Telecommunications plc 18
  18. 18. Results…ongoing Difficult to  scheduleschedule Testing… ~10 minutes to score: 10 million records 1 cached model – regression tree, 5 parameters, approx 12kBytes as JSON © British Telecommunications plc 21
  19. 19. Extracting maximum value from Big Data Plumbing D i tt d t i tDesign patterns: data ingest,  incremental processing, file  formats, compression, new  compute engines…p g A l t t t t bl i i htA place to get trustable insight Network/service observatory,  data provenance & governance,  collaborative analysis…y Enabling data science ‘at scale’ In‐Hadoop modelling, in‐life A‐B  trials © British Telecommunications plc 22 trials…
  20. 20. Bits and pieces…. R Markdown • Capture analytic workflow in a reproducible way • Publish documents to interested community R packagesR packages • 'Hive' – simple wrapper around RJDBC to facilitate access to Hive • TODO: packages to support model creation and persistenceTODO: packages to support model creation and persistence © British Telecommunications plc 23
  21. 21. Observatory Data lifecycle: Analysis:Data lifecycle: Tools to support  discovery,  provenance,  comprehension Analysis: Capture analytical  workflow – reproduce, test,  validatecomprehension,  usage validate… Observatory: a place where we  can observe and  understand the  health of our  systems Optimise: Models: © British Telecommunications plc 24 Opt se Optimise  storage of data assets Models: PMML
  22. 22. Q i ?Questions?

×