Your SlideShare is downloading. ×
0
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big data bi-mature-oanyc summit

3,983

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,983
On Slideshare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Language – one word changes and the whole meaning shifts one word like Hadoop is creating seismic shifts in the world of data and its useCore themes: History – cycle of discover-construct-discover-construct Humans building EDW has had its time but will represent learning and museum for future Kognitio is new bridge between business need and the information stores of the EDW and Hadoop
  • http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/the “data scientist.” It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook.) But thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before. If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.http://www.guardian.co.uk/news/datablog/2012/mar/02/data-scientist#zoomed-picture
  • http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/the “data scientist.” It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook.) But thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before. If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.http://www.guardian.co.uk/news/datablog/2012/mar/02/data-scientist#zoomed-picture
  • http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/the “data scientist.” It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook.) But thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before. If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.http://www.guardian.co.uk/news/datablog/2012/mar/02/data-scientist#zoomed-picture
  • Very crudely
  • Can play the hyphen game more connected-users more-connected usersInformation consumers – your users or you the user?Creates more pressure on the infrastructure – more queries
  • Mobile access is coming alongApplication space broadening BYODCan supply access to BIBut also furiously generate data for BIAccess to dynamic information but every access generates data and possible inferencesSelf-service access
  • IBM Institute for Business Value/Said Business School Survey Defining big dataMuch of the confusion about big data begins with thedefinition itself. To understand our study respondents’definition of the term, we asked each to select up to twocharacteristics of big data. Rather than any single characteristicclearly dominating among the choices, respondents weredivided in their views on whether big data is best describedby today’s greater volume of data, the new types of data andanalysis, or the emerging requirements for more real-timeinformation analysis (see Figure 1).
  • EXPERIMENTING? So is this you, not quite sureOne CTO told me he invented a Hadoop project just to keep developers happy – it’s important for their own techie development that they have the experience! That may sound DAFT… hot think of Hadoop in 2013 like the Internet in 1996… if you are not far down that road you’re behind and need to catch-up. The community is growing and building, they will address many of the limitations we see today… SOMEDAY
  • Hadoop is not "universal solution“!Way too much hype and hyperbole - great for innovators and start-ups not so good for plain old business
  • DW demanded ETL to map data into model and ensure logical consistency - upfront prerequisiteHadoop making people lazy – it cuts out thought but leaves future decisions wide open – no lock in, cuts risks of bad decisionsSimplified decisions of what to keep – keep it allBUT hey BI needs structure and discipline!!!!
  • Bottlenecks caused by platforms and tools unable to cope with demands of complexity, disparity and volumeComplex analyticsMachine learning – fraud detection/gamingWeb Analytics – Dynamic content/bid managementModelling – traditional clustering/behavioural for marketing/product development/resource optimisationInvestigative Reporting (Dashboards and reports with granular data access)Data Model
  • Hadoop – the false dawnThere, I have dared to say it!Does not accelerate BI in quite the same way asthe EDWUsers have had a decade of being sold train-of-thought - icubes and Visual Insight Hadoop - Not hands on, not desktop, not agile
  • Lots of access to data - iterationsAnalytics is about work done – more work needs to be doneSo don’t hold CPUs back!In-memory is not cache!Memory is underplayed in Hadoop - its cheap use it!Processors and Ram are true measure of work that can be done – disks just fetchKeep data in memory!!! Don’t swap, don’t wait on disk don’t pick through indexes then data, just access what is needed.Economics of RAM have changed, much lower cost, large volumes readily available
  • Ah yes plugging into Hadoop So much for noSQL revolutionUniversal integration needed – protect the BI investmentLost the gun fight like all revolutions the upstarts died down and got absorbed (subsumed)Business and BI investment demands SQL!Hive now we have drill, impala, Pivotal,Tough game – yes its SQL access but not low latency
  • What the business cares about is getting work doneThey really don’t care about how it is stored or where it is stored!Its not about raw individual speed its about throughputAddress the bottlenecksToo many vendors play games that just shift the bottleneck
  • BI mostly focuses (sells) on presentation – Graphics, pictures, VisualisationBUT behind the scenes a lot of heavy lifting has to be doneThis workload has changed over time from the simple to complex
  • No need for single platforms like the traditional DW – stores and analysesThis is why data sciences risesWe did not get this in rise of data mining in the 90’sWe’ll come onto RAM shortly
  • BATCH Hadoop disk centric – Storage - just like the EDW more parallelism yes, lots more but still batch disk I/O centricSchedulers not designed for rapid responseEssentially a batch queue – BI applications and business users have significantly evolved from batch reporting
  • Transcript

    • 1. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCHadoop meets Mature BI:Where the rubber meets the road forData ScientistsMichael HiskeyFuturist, + Product EvangelistVP, Marketing & Business DevelopmentKognitio
    • 2. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCThe Data ScientistSexiest job of the 21st Century?
    • 3. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCKey Concept: GraduationProjects will needto Graduatefrom theData Science Laband become partofBusiness as Usual
    • 4. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCDemand for the Data ScientistOrganizational appetite for tens, not hundreds
    • 5. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCDon’t be a Railroad Stoker!Highly skilled engineering required …but the world innovated around them.
    • 6. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCBusiness IntelligenceNumbersTablesChartsIndicatorsTime- History- LagAccess- to view (portal)- to data- to depth- Control/SecureConsumption- digestion…with ease and simplicityStraddle IT and BusinessFasterLower latencyMore granularityRicher data modelSelf service
    • 7. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCWhat has changed?Moreconnected-users?More-connectedusers?
    • 8. According to oneestimate, mankindcreated 150 exabytesof data in 2005(billion gigabytes)In 2010 this was1,200 exabytes
    • 9. Data flow
    • 10. @Kognitio @mphnyc #OANYCData Variety
    • 11. @Kognitio @mphnyc #OANYCRespondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have beenabbreviated, and selections have been normalized to equal 100%. n=1144Source: IBM Institute for Business Value/Said Business School SurveyWhat?New value comes from your existing data
    • 12. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC© 20th Century Fox
    • 13. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCHadoop ticks many but not all the boxesaaaaaaaaaa a aaaa aa aa aa aaa aaaa aa aa
    • 14. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC No need to pre-process No need to align to schema No need to triageNull storage concerns
    • 15. @Kognitio @mphnyc #OANYCMachine learningalgorithms DynamicSimulationStatisticalAnalysisClusteringBehaviourmodellingThe drive for deeper understandingReporting & BPMFraud detectionDynamicInteractionTechnology/AutomationAnalyticalComplexityCampaignManagement#MPP_R
    • 16. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCHadoop just tooslow for interactiveBI!…loss of train-of-thought“while hadoop shines as a procplatform, it is painfully slow as
    • 17. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCAnalytics needslow latency, no I/O waitHigh speed in-memory processing
    • 18. Analytical Platform: Reference ArchitectureAnalyticalPlatformLayerNear-lineStorage(optional)Application &Client LayerAll BI Tools All OLAP Clients ExcelPersistenceLayer HadoopClustersEnterprise DataWarehousesLegacySystems…ReportingCloudStorage
    • 19. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYCThe FutureBig DataAdvanced AnalyticsIn-memoryLogical Data WarehousePredictive AnalyticsData Scientists
    • 20. connectwww.kognitio.comtwitter.com/kognitiolinkedin.com/companies/kognitiotinyurl.com/kognitio youtube.com/kognitioNA: +1 855 KOGNITIOEMEA: +44 1344 300 770
    • 21. @Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #MPP_RHadoop meets Mature BI:Where the rubber meets the road forData Scientists• The key challenge for Data Scientists is not the proliferation of theirroles, but the ability to ‘graduate’ key Big Data projects from the‘Data Science Lab’ and production-ize them into their broaderorganizations.• Over the next 18 months, "Big Data will become just "Data"; thismeans everyone (even business users) will need to have a way touse it - without reinventing the way they interact with their currentreporting and analysis.• To do this requires interactive analysis with existing tools andmassively parallel code execution, tightly integrated with Hadoop.Your Data Warehouse is dying; Hadoop will elicit a material shiftaway from price per TB in persistent data storage.
    • 22. The new bounty hunters:DrillImpalaPivotalStingerThe No SQL PosseWantedDead or AliveSQL
    • 23. It’s all about getting work doneUsed to be simple fetch of valueTasks evolving:Then was calc dynamic aggregateNow complex algorithms!
    • 24. @Kognitio @mphnyc #MPP_Rcreate external script LM_PRODUCT_FORECAST environment rsintreceives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALESpartition by PRODNO order by PRODNO, ROW_IDsends ( R_OUTPUT varchar )isolate partitionsscript Sendofr( # Simple R script to run a linear fit on daily salesprod1<-read.csv(file=file("stdin"), header=FALSE,row.namescolnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW),daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Yearcast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_IDrank() over (partition by Trans_Year order by count(distinct Arank() over (partition by Trans_Year order by sum(total_spend)from( select Account_ID,Extract(Year from Effective_Date) Trans_Year,count(Transaction_ID) Num_Trans,select dept, sum(sales)from sales_factWhere period between date ‘01-05-2006’ and date ‘31-05-2006’group by depthaving sum(sales) > 50000;select sum(sales)from sales_historywhere year = 2006 and month = 5 and region=1;select total_salesfrom summarywhere year = 2006 and month = 5 and region=1;Behind thenumbers
    • 25. @Kognitio @mphnyc #MPP_RFor once technology is on our sideFirst time we have full triumvirate of– Excellent Computing power– Unlimited storage– Fast Networks…now that RAM is cheap!
    • 26. @Kognitio @mphnyc #MPP_RLots of theseNot so many of theseHadoop is…Hadoop inherently disk orientedTypically low ratio of CPU to Disk

    ×