Hadoop meets Mature BI: Data Scientists

745 views
721 views

Published on

The key challenge for Data Scientists is not the proliferation of their roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production-ize them into their broader organizations.

Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it - without reinventing the way they interact with their current reporting and analysis.

To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop. Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
745
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop meets Mature BI: Data Scientists

  1. 1. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCHadoop meets Mature BI: Where the rubber meets the road for Data ScientistsMichael HiskeyFuturist, + Product EvangelistVP, Marketing & Business DevelopmentKognitio
  2. 2. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCThe Data ScientistSexiest job of the 21st Century?
  3. 3. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCKey Concept: GraduationProjects will need to Graduatefrom the Data Science Lab and become part of Business as Usual
  4. 4. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCDemand for the Data ScientistOrganizational appetite for tens, not hundreds© EMC Corporation and  The Guardian UK™ http://www.guardian.co.uk/news/datablog/2012/mar/02/data‐scientist#zoomed‐picture
  5. 5. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCDon’t be a Railroad Stoker!Highly skilled engineering required … but the world innovated around them.
  6. 6. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCBusiness IntelligenceNumbersTablesChartsIndicatorsTime‐ History‐ LagAccess‐ to view (portal)‐ to data‐ to depth‐ Control/SecureConsumption‐ digestion…with ease and simplicityStraddle IT and BusinessFasterLower latencyMore granularityRicher data modelSelf service
  7. 7. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCWhat has changed?Moreconnected-users?More-connectedusers?
  8. 8. According to one estimate, mankind created 150 exabytesof data in 2005(billion gigabytes)In 2010 this was 1,200 exabytes
  9. 9. Data flow
  10. 10. @Kognitio  @mphnyc  #OANYCData Variety
  11. 11. @Kognitio  @mphnyc  #OANYCRespondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have been abbreviated, and selections have been normalized to equal 100%. n=1144Source: IBM Institute for Business Value/Said Business School Survey What?  New value comes from your existing data
  12. 12. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC© 20th Century Fox
  13. 13. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCHadoop ticks many but not all the boxes
  14. 14. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC No need to pre‐process No need to align to schema No need to triage Null storage concerns
  15. 15. @Kognitio  @mphnyc  #OANYCMachine learning algorithms DynamicSimulationStatistical AnalysisClusteringBehaviour modellingThe drive for deeper understandingReporting & BPMFraud detectionDynamic InteractionTechnology/AutomationAnalytical ComplexityCampaign Management#MPP_R
  16. 16. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCHadoop just too slow for interactive BI!…loss of train‐of‐thought“while hadoop shines as a processingplatform, it is painfully slow as a query tool”
  17. 17. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCAnalytics needslow latency, no I/O waitHigh speed in‐memory processing
  18. 18. Analytical Platform: Reference ArchitectureAnalyticalPlatformLayerNear‐lineStorage(optional)Application &Client LayerAll BI Tools All OLAP Clients ExcelPersistenceLayer HadoopClustersEnterprise DataWarehousesLegacySystems…ReportingCloud Storage
  19. 19. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYCThe FutureBig DataAdvanced AnalyticsIn-memoryLogical Data WarehousePredictive AnalyticsData Scientists
  20. 20. connectwww.kognitio.comtwitter.com/kognitiolinkedin.com/companies/kognitiotinyurl.com/kognitio youtube.com/kognitioNA: +1 855  KOGNITIOEMEA: +44 1344 300 770THESE SLIDES: www.slideshare.net/Kognitio
  21. 21. @Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #MPP_RHadoop meets Mature BI: Where the rubber meets the road for Data Scientists• The key challenge for Data Scientists is not the proliferation of their roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production‐ize them into their broader organizations. • Over the next 18 months, "Big Data will become just "Data"; this means everyone (even business users) will need to have a way to use it ‐ without reinventing the way they interact with their current reporting and analysis.• To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop.  Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.
  22. 22. The new bounty hunters:DrillImpalaPivotalStingerThe No SQL PosseWANTEDDEAD OR ALIVESQL
  23. 23. It’s all about getting work doneUsed to be simple fetch of valueTasks evolving: Then was calc dynamic aggregateNow complex algorithms!
  24. 24. @Kognitio  @mphnyc  #MPP_Rcreate external script LM_PRODUCT_FORECAST environment rsintreceives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALESpartition by PRODNO order by PRODNO, ROW_IDsends ( R_OUTPUT varchar )isolate partitionsscript Sendofr( # Simple R script to run a linear fit on daily salesprod1<-read.csv(file=file("stdin"), header=FALSE,row.namescolnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW),daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Yearcast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_IDrank() over (partition by Trans_Year order by count(distinct Arank() over (partition by Trans_Year order by sum(total_spend)from( select Account_ID,Extract(Year from Effective_Date) Trans_Year,count(Transaction_ID) Num_Trans,sum(Transaction Amount) Total Spend,select dept, sum(sales)from sales_factWhere period between date ‘01-05-2006’ and date ‘31-05-2006’group by depthaving sum(sales) > 50000;select sum(sales)from sales_historywhere year = 2006 and month = 5 and region=1;select total_salesfrom summarywhere year = 2006 and month = 5 and region=1;Behind the numbers
  25. 25. @Kognitio  @mphnyc  #MPP_RFor once technology is on our sideFirst time we have full triumvirate of– Excellent Computing power– Unlimited storage– Fast Networks…now that RAM is cheap!
  26. 26. @Kognitio  @mphnyc  #MPP_RLots of theseNot so many of theseHadoop is… Hadoop inherently disk orientedTypically low ratio of CPU to Disk

×