Your SlideShare is downloading. ×
Paco Nathanbit.ly/pxnnews@pacoid“Hadoop and Beyond”1
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we charac...
First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we charac...
First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we charac...
First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we charac...
First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory sp...
First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory sp...
First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory sp...
First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory sp...
First Principlesa generation of computer scientists has beentaught to think “relational” – data on a DB serverRDBMS made s...
First Principlesa generation of computer scientists has beentaught to think “relational” – data on a DB serverRDBMS made s...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Topologieslargely due to the rapid rise of machine data, circa late 1990s,we use distributed systemsbecause the data won’t...
RDBMSSQL Queryresult setsrecommenders+classifiersWeb AppscustomertransactionsAlgorithmicModelingLogseventhistoryaggregation...
RDBMSSQL Queryresult setsrecommenders+classifiersWeb AppscustomertransactionsAlgorithmicModelingLogseventhistoryaggregation...
TopologiesHadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out base...
ApacheWikipediaHadoop, as a topologycomponents which implement MapReduce:• name node / data node• job tracker / task track...
Some Other Topologies…Spark (iterative/interactive)Titan (graph database)Redis (in-memory data grid)Zookeeper (distributed...
CAP Theorem“You can have at most two of these properties for any shared-data  system… the choice of which feature to disca...
financial transactions general ledger in RDBMS C A xad-hoc queries RDS (hosted MySQL) C A xreporting, dashboards like Penta...
financial transactions general ledger in RDBMS C A xad-hoc queries RDS (hosted MySQL) C A xreporting, dashboards like Penta...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.htmle...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Modelingback in the day, we worked with practices based ondata modeling1. sample the data2. fit the sample to a known distr...
Two Cultures“A new research community using these tools sprang up.Their goalwas predictive accuracy.The community consiste...
Algorithmic Modeling“The trick to being a scientist is to be open to usinga wide variety of tools.” – Breimancirca 2001: R...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Attentionimpromptu survey:• how many people say they practice some kind of “Agile” process at work?• how many people say t...
Agile Data?some people see a reconciliation of Agile process and Big Data…Agile DataRussell Jurney, 2013amazon.com/dp/1449...
Perhaps Notgreat values, wrong domain…that worked when we were building features in web appsAgile represents industrializa...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Business DisruptionGeoffrey MooreMohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:what Amazon did to t...
algorithmic modeling + machine data+ curation, metadata + Open Datadata products, as feedback into automationevolution of ...
Internet of Things43
A Thought Exerciseconsider that when a company like Catepillar movesinto data science, they won’t be building the world’sn...
Alternatively…climate.com45
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Algorithmsmany algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ y...
How much does it cost you to earn $1B?also, take a moment to check this out…(IMHO most interesting algorithm work recently...
How much does it cost you to earn $1B?also, take a moment to check this out…(IMHO most interesting algorithm work recently...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Personalitywe have perhaps built computers (once named “electronicbrains”) in the image of JohnVon Neumann, et al.: standa...
Chasing Unicornssilos… but didn’t that all change?because the data won’t fit on one computer anymoreleverage with data scie...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Clustersa little secret: people like me make a good living byleveraging high ROI apps based on clusters, and sothe execs a...
Operating Systems, reduxmeanwhile, GOOG is 3+ generations ahead,with much improved ROI on data centersJohn Wilkes, et al.B...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
TrendlinesBig Data? we’re just getting started:• ~12 exabytes/day, jet turbines on commercial flights• Google self-driving ...
Three Laws, or more?meanwhile, architectures evolve toward much, much larger data…pistoncloud.com/ ...Rich Freitas, IBM Re...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
LanguagesJVM-based languages became popular for Big Data open sourcetechnologies:• partly becauseYHOO adopted Hadoop, etc....
Functional Programming for Big DataWordCount with token scrubbing…Apache Hive: 52 lines HQL + 8 lines Python (UDF)compared...
references…“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”Vitaly Gordon [ especially see slide #9 ]slideshar...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
OrganizationHow Do Committees Invent?Melvin Conway, 1968melconway.com/research/committees.htmlManu Cornet bonkersworld.net...
Cooperationperhaps we have selected for the wrongpersonality to idealize…linkedin.com/today/post/article/20130520190305-11...
discoverydiscoverymodelingmodelingintegrationintegrationappsappssystemssystemsbusiness process,stakeholderdata prep, disco...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
ArchitectureRich Hickey, Nathan Marz, Stuart Sierra, et al.:functional programming to help reducecosts over time1. technic...
Lambda ArchitectureBig DataNathan Marz, James Warrenmanning.com/marz• batch layer (immutable data, idempotent ops)• servin...
Pattern Languagestructured method for solving large, complex designproblems, where the syntax of the language ensuresthe u...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
CultureNotes from the Mystery Machine BusSteveYegge, Googlegoo.gl/SeRZaconsider these perspectivesin light of Conway’s Law...
Two Avenues to the App Layer…scale ➞complexity➞Enterprise: must contend withcomplexity at scale everyday…incumbents extend...
approximately 80% of the costs for data-related projectsgets spent on data preparation – mostly on cleaning updata quality...
BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchit...
FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPer...
Learning Curvesdifficulties in the commercial use of distributed systemsoften get represented as issues of managing complex...
Managementultimately, the challenge is about managinglearning curves within a social contextest. cost of individual learni...
Managementultimately, the challenge is about managinglearning curves within a social contextest. cost of individual learni...
blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.htmlThrowYour Life a CurveWhitney JohnsonAggressively Pro-Active Lea...
Summaryto be competitive globally with Big Datarequires learning many technologies –then learning the nuances of a code ba...
Cascading: Workflow AbstractionFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsDesign Pattern...
Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, a...
Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, a...
Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, a...
Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, a...
Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, a...
Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, a...
ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure...
a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusines...
a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusines...
cascading.orgETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in...
ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products Customer...
Enterprise DataWorkflowswith CascadingO’Reilly, 2013amazon.com/dp/1449358721references…107
blog, dev community, code/wiki/gists, maven repo,commercial products, career opportunities, newsletter:cascading.orgzest.t...
Upcoming SlideShare
Loading in...5
×

Hadoop and Beyond

2,665

Published on

A talk given in Austin, Texas on 2013-05-28 about how cognitive bias interferes with leveraging distributed systems for large-scale apps. Also, about design patterns for Enterprise data workflows. http://hadoop-and-beyond-austin.eventbrite.com/

Published in: Technology

Transcript of "Hadoop and Beyond"

  1. 1. Paco Nathanbit.ly/pxnnews@pacoid“Hadoop and Beyond”1
  2. 2. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsContents:1. Conceptual Map2. Design Patterns2
  3. 3. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop3
  4. 4. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopFirstPrinciplesBeyondHadoop4
  5. 5. First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/O5
  6. 6. First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/OCPU6
  7. 7. First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/ORAM7
  8. 8. First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/OI/O8
  9. 9. First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space9
  10. 10. First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space• okay, maybe the CPU was multi-core…• okay, maybe RAM paged out to virtual memory…• okay, maybe the disks were in a RAID config…10
  11. 11. First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space• okay, maybe the CPU was multi-core…• okay, maybe RAM paged out to virtual memory…• okay, maybe the disks were in a RAID config…or there were extra caches, or separate busses, etc.but essentially those were incremental extensionsto aVon Neumann architecture…11
  12. 12. First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space• okay, maybe the CPU was multi-core…• okay, maybe RAM paged out to virtual memory…• okay, maybe the disks were in a RAID config…or there were extra caches, or separate busses, etc.but essentially those were incremental extensionsto aVon Neumann architecture…a machine created in his image, if you willNB: credit should go to Eckert and Mauchly, inventors of the ENIAC12
  13. 13. First Principlesa generation of computer scientists has beentaught to think “relational” – data on a DB serverRDBMS made sense, with their indexes, b-trees,normal forms, etc.Q: need to query bigger data?A: simple, buy or lease a bigger DB server13
  14. 14. First Principlesa generation of computer scientists has beentaught to think “relational” – data on a DB serverRDBMS made sense, with their indexes, b-trees,normal forms, etc.Q: need to query bigger data?A: simple, buy or lease a bigger DB serverhowever, that all changed…some of the issues encountered in large-scaledata teams are, to put it politely, obscurestarting from first principles, let’s explore amap of some important points to consider14
  15. 15. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves15
  16. 16. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop Topologies16
  17. 17. Topologieslargely due to the rapid rise of machine data, circa late 1990s,we use distributed systemsbecause the data won’t fit on one computer anymoreAMZN, EBAY,YHOO, GOOG leveraged horizontal scale-out,based on commodity hardwarepractices at LinkedIn,Apple, Facebook,Twitter, etc., followedfrom those early successesalgorithmic modeling, applied to the aggregation of machinedata, allowed for Big Data to become monetizeda feedback loop evolved – refining aggregate social interactionsinto data products, which in turn made web apps becomemore intelligent17
  18. 18. RDBMSSQL Queryresult setsrecommenders+classifiersWeb AppscustomertransactionsAlgorithmicModelingLogseventhistoryaggregationdashboardsProductEngineeringUXStakeholder CustomersDW ETLMiddlewareservletsmodelsCirca 2001: post- big e-commerce successes18
  19. 19. RDBMSSQL Queryresult setsrecommenders+classifiersWeb AppscustomertransactionsAlgorithmicModelingLogseventhistoryaggregationdashboardsProductEngineeringUXStakeholder CustomersDW ETLMiddlewareservletsmodelsCirca 2001: post- big e-commerce successes“data products”19
  20. 20. TopologiesHadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out basedon commodity hardwarebecause the data won’t fit on one computer anymorea variety of Big Data technologies has since emerged,which can be categorized in terms of topologies andthe CAP Theorem20
  21. 21. ApacheWikipediaHadoop, as a topologycomponents which implement MapReduce:• name node / data node• job tracker / task tracker• submit queue• task slots• distributed cache• HDFS21
  22. 22. Some Other Topologies…Spark (iterative/interactive)Titan (graph database)Redis (in-memory data grid)Zookeeper (distributed metadata)HBase (columnar data objects)Cassandra (key-value store)Storm (real-time streams)ElasticSearch (search index)MongoDB (document store)Greenplum (MPP)SciDB (array database)22
  23. 23. CAP Theorem“You can have at most two of these properties for any shared-data  system… the choice of which feature to discard determines the  nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO)C APstrongconsistencyhighavailabilitypartitiontoleranceeventualconsistencycs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdfjulianbrowne.com/article/viewer/brewers-cap-theorem23
  24. 24. financial transactions general ledger in RDBMS C A xad-hoc queries RDS (hosted MySQL) C A xreporting, dashboards like Pentaho C A xlog rotation/persistence like Riak, Cassandra x x Psearch indexes like ElasticSearch, Solr x A Pstatic content, archives S3 (durable storage) x A Pkey/value data objects like HBase C x Pdata prep, ETL, modeling at scale like Hadoop/Cascading C x Pgraph queries like Titan C x PAccess → Frameworks → CAP Theorem Forfeits24
  25. 25. financial transactions general ledger in RDBMS C A xad-hoc queries RDS (hosted MySQL) C A xreporting, dashboards like Pentaho C A xlog rotation/persistence like Riak, Cassandra x x Psearch indexes like ElasticSearch, Solr x A Pstatic content, archives S3 (durable storage) x A Pkey/value data objects like HBase C x Pdata prep, ETL, modeling at scale like Hadoop/Cascading C x Pgraph queries like Titan C x PAccess → Frameworks → CAP Theorem Forfeits25
  26. 26. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere26
  27. 27. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere“optimize topologies”27
  28. 28. Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.htmleBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdfInktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXMGoogle“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspxMIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.htmlIn their own words…28
  29. 29. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves29
  30. 30. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop Modeling30
  31. 31. Modelingback in the day, we worked with practices based ondata modeling1. sample the data2. fit the sample to a known distribution3. ignore the rest of the data4. infer, based on that fitted distributionthat served well with ONE computer, ONE analyst,ONE model… just throw away annoying “extra” datacirca late 1990s: machine data, aggregation, clusters, etc.algorithmic modeling displaced data modelingbecause the data won’t fit on one computer anymore31
  32. 32. Two Cultures“A new research community using these tools sprang up.Their goalwas predictive accuracy.The community consisted of young computerscientists, physicists and engineers plus a few aging statisticians.They began using the new tools in working on complex predictionproblems where it was obvious that data models were not applicable:speech recognition, image recognition, nonlinear time series prediction,handwriting recognition, prediction in financial markets.”Statistical Modeling: TheTwo CulturesLeo Breiman, 2001bit.ly/eUTh9Lin other words, seeing the forest for the trees…this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling(machine data for automation/optimization)32
  33. 33. Algorithmic Modeling“The trick to being a scientist is to be open to usinga wide variety of tools.” – Breimancirca 2001: Random Forest, bootstrap aggregation, etc.,yield dramatic increases in predictive power over earliermodeling such as Logistic Regressionmajor learnings from the Netflix Prize: the power ofensembles, model chaining, etc.the problems at hand have become simply too big and toocomplex for ONE distribution, ONE model, ONE team…stanford.edu/~lmackey/papers/netflix_story-nas11-slides.pdf33
  34. 34. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves34
  35. 35. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop Attention35
  36. 36. Attentionimpromptu survey:• how many people say they practice some kind of “Agile” process at work?• how many people say that they DON’T practice “Agile” ?• how many people say they are in a lean startup?Q:with respect to Big Data practices,how is that working out?Abby Fichtner vimeo.com/2779740836
  37. 37. Agile Data?some people see a reconciliation of Agile process and Big Data…Agile DataRussell Jurney, 2013amazon.com/dp/1449326269“Run like a studio, not an assembly line.”37
  38. 38. Perhaps Notgreat values, wrong domain…that worked when we were building features in web appsAgile represents industrialization of software engineering,codifying social interactions, compartmentalizing attentionmeanwhile, Data Science is inherently multi-disciplinary:• teams of people with complementary skill sets• actionable insights require weeks/months, not hours• variance and statistical thinking are foreign to CSLinkedIn-style problems circa 2011 required certain skills…manipulating the Newtonian physics of data… that moneymay be mostly off the table by nowBig Data opportunities ahead require different math?38
  39. 39. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves39
  40. 40. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopBusiness40
  41. 41. Business DisruptionGeoffrey MooreMohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:what Amazon did to the retail sector… has put the entire Global 1000on notice over the next decade… data as the major force… mostlythrough apps – verticals, leveraging domain expertiseMichael StonebrakerINGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012:complex analytics workloads are now displacing SQL as the basisfor Enterprise appsLarry PageCEO, Google / Wired, 2013:create products and services that are 10 times better than thecompetition… thousand-percent improvement requires rethinkingproblems entirely, exploring the edges of what’s technically possible,and having a lot more fun in the process41
  42. 42. algorithmic modeling + machine data+ curation, metadata + Open Datadata products, as feedback into automationevolution of feedback loopsless about “bigness”, more about complexityinternet of things + A/D conversion+ complex analyticsaccelerated evolution, additional feedback loopsorders of magnitude higher data ratesInternet ofThings accelerates this process of disruptionBusiness Driverssource: National Geographic“A kind of Cambrian explosion”source: National Geographic42
  43. 43. Internet of Things43
  44. 44. A Thought Exerciseconsider that when a company like Catepillar movesinto data science, they won’t be building the world’snext search engine or social networkthey will most likely be optimizing supply chain,optimizing fuel costs, automating data feedbackloops integrated into their equipment…that’s a $50B company,in a market segment worth $250Bupcoming: tractors as drones –guided by complex, distributed data appsOperations Research –crunching amazing amounts of data44
  45. 45. Alternatively…climate.com45
  46. 46. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves46
  47. 47. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopAlgorithms47
  48. 48. Algorithmsmany algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ years agoMapReduce is Good Enough?Jimmy Lin, UMDumiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdfastrophysics and genomics are light years ahead in sophisticatedalgorithms work – as Breiman suggested in 2001 – which may takea while to percolate into industryother game-changers:• streaming algorithms, sketches, probabilistic data structures• significant “Big O” complexity reduction (e.g., skytree.net)• better architectures and topologies (e.g., GPUs and CUDA)• partial aggregates – parallelizing workflows48
  49. 49. How much does it cost you to earn $1B?also, take a moment to check this out…(IMHO most interesting algorithm work recently)QR factorization of a “tall-and-skinny” matrix• used to solve many data problems at scale,e.g., PCA, SVD, etc.• numerically stable with efficient implementationon large-scale Hadoop clusterssuppose that you have a sparse matrix of customerinteractions where there are 100MM customers,with a limited set of outcomes…cs.purdue.edu/homes/dgleichstanford.edu/~arbensongithub.com/ccsevers/scalding-linalgDavid Gleich, slideshare.net/dgleichTristan Jehan49
  50. 50. How much does it cost you to earn $1B?also, take a moment to check this out…(IMHO most interesting algorithm work recently)QR factorization of a “tall-and-skinny” matrix• used to solve many data problems at scale,e.g., PCA, SVD, etc.• numerically stable with efficient implementationon large-scale Hadoop clusterssuppose that you have a sparse matrix of customerinteractions where there are 100MM customers,with a limited set of outcomes…cs.purdue.edu/homes/dgleichstanford.edu/~arbensongithub.com/ccsevers/scalding-linalgDavid Gleich, slideshare.net/dgleichTristan Jehandistributed algorithms for high ROIuse cases on cost-effective clusteredresources…we’re learning how to do it right50
  51. 51. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves51
  52. 52. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopPersonality52
  53. 53. Personalitywe have perhaps built computers (once named “electronicbrains”) in the image of JohnVon Neumann, et al.: standalonegenius, aristotelian uber-geek, incredible capacity for memoryand logic, overbearing, not particularly cooperative…one can almost imagine a war-time dialogue,“Get one of theseguys in the room, they’ll solve anything!” … as a result, decadesof mutually assured destruction for global strategyQ:have we created software engineering practices which selected forthis kind of personality? selecting for “lone wolf” guys, sociallyawkward, ONE person who can understand an entire code base,able to out-logic and out-argue the rest of the room… charmingfellow, reallyhave we enabled software process to box these personalitiesinto something resembling teams? along with overtly describedrules for social conventions… silos, in other words53
  54. 54. Chasing Unicornssilos… but didn’t that all change?because the data won’t fit on one computer anymoreleverage with data science teams is where organizationstear down internal silos, socializing hard problemsdata won’t fit on one computer anymore, problems won’tfit in one department anymore, the code base won’t fitinto one uber-geek’s memory recall anymore…so we embrace distributed systems for solutionsQ:“Why aren’t there more women in engineering?”IMHO, we’re trying to select for a personality whichdoesn’t exist, and would not resolve current challenges;meanwhile, my data science teams run about 50/5054
  55. 55. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves55
  56. 56. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopClusters56
  57. 57. Clustersa little secret: people like me make a good living byleveraging high ROI apps based on clusters, and sothe execs agree to build out more data centers…clusters for Hadoop/Hive/HBase, clusters for Memcached,for Cassandra, for MySQL, for Storm, for Nginx, etc.this becomes expensive!a single class of workloads on a given cluster is simplerto manage; but terrible for utilizationleveragingVMs and various notions of “cloud” helpsCloudera, Hortonworks, probably EMC soon: sell a notionof “Hadoop as OS” All your workloads are belong to usregardless of how architectures change, death and taxeswill endure: servers fail, and data must moveGoogle Data Center, Fox News~200257
  58. 58. Operating Systems, reduxmeanwhile, GOOG is 3+ generations ahead,with much improved ROI on data centersJohn Wilkes, et al.Borg/Omega:“10x” secret sauceyoutu.be/0ZFMlO98Jkc0%25%50%75%100%RAILS CPULOADMEMCACHEDCPU LOAD0%25%50%75%100%HADOOP CPULOAD0%25%50%75%100%t t0%25%50%75%100%RailsMemcachedHadoopCOMBINED CPU LOAD (RAILS,MEMCACHED, HADOOP)Florian Leibert, Chronos/Mesos @ AirbnbMesos, open source cloud OS – like Borgincubator.apache.org/mesos58
  59. 59. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves59
  60. 60. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopTrendlines60
  61. 61. TrendlinesBig Data? we’re just getting started:• ~12 exabytes/day, jet turbines on commercial flights• Google self-driving cars, ~1 Gb/s per vehicle• National Instruments initiative: Big Analog Data™• 1m resolution satellites skyboximaging.com• open resource monitoring reddmetrics.com• Sensing XChallenge nokiasensingxchallenge.orgconsider the implications of Jawbone, Nike, etc.,plus the secondary/tertiary effects of Google Glass7+ billion people, instrumented better than … how wehave Nagios instrumenting our web servers right nowplus the business implications given that much of theGlobal 1000 is positioned to be disrupted technologyreview.com/...61
  62. 62. Three Laws, or more?meanwhile, architectures evolve toward much, much larger data…pistoncloud.com/ ...Rich Freitas, IBM ResearchQ:what kinds of evolution in topologies couldthis imply?62
  63. 63. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves63
  64. 64. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopLanguages64
  65. 65. LanguagesJVM-based languages became popular for Big Data open sourcetechnologies:• partly becauseYHOO adopted Hadoop, etc.• partly because Enterprise IT shops have J2EE expertise• partly because of functional languages: Clojure, ScalaJVM has its drawbacks, especially for low-latency use casesample use of languages such as Python and Erlang in Big Datapractices, plus keep in mind that Google uses C++FunctionalThinkingNeal Fordyoutu.be/plSZIkLodDMa hunch: issues about current programming languages aresecondary to culture65
  66. 66. Functional Programming for Big DataWordCount with token scrubbing…Apache Hive: 52 lines HQL + 8 lines Python (UDF)compared toScalding: 18 lines Scala/Cascadingfunctional programming languages help reducesoftware engineering costs at scale, over time66
  67. 67. references…“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”Vitaly Gordon [ especially see slide #9 ]slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedinElements Of Functional ProgrammingChris Readeamazon.com/dp/020112915967
  68. 68. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves68
  69. 69. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopOrganization69
  70. 70. OrganizationHow Do Committees Invent?Melvin Conway, 1968melconway.com/research/committees.htmlManu Cornet bonkersworld.net“Any organization that designs a system(defined more broadly here than justinformation systems) will inevitablyproduce a design whose structure is acopy of the organization’s communicationstructure.”Q:• does this fit with software process?• does this fit with distributed apps?see also:haacked.com/archive/2013/05/13/applying-conways-law.aspx70
  71. 71. Cooperationperhaps we have selected for the wrongpersonality to idealize…linkedin.com/today/post/article/20130520190305-110300724-why-nothing-not-even-software-can-eat-the-worldAll long-term success depends on elicitingthe voluntary support of an ecosystem.As the African proverb says,“If you wantto go fast, go alone; if you want to go far,go with others.” – Geoffrey Moore71
  72. 72. discoverydiscoverymodelingmodelingintegrationintegrationappsappssystemssystemsbusiness process,stakeholderdata prep, discovery,modeling, etc.software engineering,automationsystems engineering,accessdatascienceDataScientistApp DevOpsDomainExpertintroducedTeam Composition: Needs × Roles72
  73. 73. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves73
  74. 74. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopArchitecture74
  75. 75. ArchitectureRich Hickey, Nathan Marz, Stuart Sierra, et al.:functional programming to help reducecosts over time1. technical debt? this is how an organizationbuilds a culture to avoid it2. Conways Law corollary: model teams andcommunication based on properties of thedesired architecture3. also consider Mesos/Borg: schedule datato be located where [CPU, RAM, I/O, surety]will become availableRich Hickey, infoq.com/presentations/Simple-Made-Easy75
  76. 76. Lambda ArchitectureBig DataNathan Marz, James Warrenmanning.com/marz• batch layer (immutable data, idempotent ops)• serving layer (to query batch)• speed layer (transient, cached “real-time”)• combining results76
  77. 77. Pattern Languagestructured method for solving large, complex designproblems, where the syntax of the language ensuresthe use of best practices – i.e., conveying expertiseFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsA Pattern LanguageChristopher Alexander, et al.amazon.com/dp/019501919977
  78. 78. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves78
  79. 79. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopCulture79
  80. 80. CultureNotes from the Mystery Machine BusSteveYegge, Googlegoo.gl/SeRZaconsider these perspectivesin light of Conway’s Law…“conservatism” “liberalism”(mostly) Enterprise (mostly) Start-Uprisk management customer experimentsassurance flexibilitywell-defined schema schema follows codeexplicit configuration conventiontype-checking compiler interpreted scriptswants no surprises wants no impedimentsJava, Scala, Clojure, etc. PHP, Ruby, Python, etc.Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.80
  81. 81. Two Avenues to the App Layer…scale ➞complexity➞Enterprise: must contend withcomplexity at scale everyday…incumbents extend current practices andinfrastructure investments – using J2EE,ANSI SQL, SAS, etc. – to migrateworkflows onto Apache Hadoop whileleveraging existing staffStart-ups: crave complexity andscale to become viable…new ventures move into Enterprise spaceto compete using relatively lean staff,while leveraging sophisticated engineeringpractices, e.g., Cascalog and Scalding81
  82. 82. approximately 80% of the costs for data-related projectsgets spent on data preparation – mostly on cleaning updata quality issues: ETL, log files, etc., generally by socializingthe problemunfortunately, data-related budgets tend to go intoframeworks which can only be used after clean upmost valuable skills:‣ learn to use programmable tools that prepare data‣ learn to understand the audience and their priorities‣ learn to generate compelling data visualizations‣ learn to estimate the confidence for reported results‣ learn to automate work, making analysis repeatabled3js.orgWhat is needed?82
  83. 83. BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves83
  84. 84. FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopLearningCurves84
  85. 85. Learning Curvesdifficulties in the commercial use of distributed systemsoften get represented as issues of managing complexitymuch of the risk in managing a data science team is aboutbudgeting for learning curve: some orgs practice a kind ofengineering “conservatism”, with highly structured processand strictly codified practices – people learn a few thingswell, then avoid having to struggle with learning many newthings perpetually…that approach leads to enormous teams and low ROI scale➞complexity➞ultimately, the challenge is aboutmanaging learning curves withina social context85
  86. 86. Managementultimately, the challenge is about managinglearning curves within a social contextest. cost of individual learning, initial implest.costofteamre-learning,lifecyclesome technologies constrain theneed to learn, others acceleratere-learning prior business logic…choose the latter, FTW!86
  87. 87. Managementultimately, the challenge is about managinglearning curves within a social contextest. cost of individual learning, initial implest.costofteamre-learning,lifecyclesome technologies constrain theneed to learn, others acceleratere-learning prior business logic…choose the latter, FTW!IMHO, the “agile” part was intended to beabout shared learnings; while the “lean” partwas about how much you have on your plateat any one time87
  88. 88. blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.htmlThrowYour Life a CurveWhitney JohnsonAggressively Pro-Active Learning• deconstruction of the cognitive bias One Size Fits All• “makes a compelling case for personal disruption”• “plan your career around learning curves”• hire people who learn/re-learn efficiently88
  89. 89. Summaryto be competitive globally with Big Datarequires learning many technologies –then learning the nuances of a code base forwhich the team is responsible, learning theever-changing surprises and insights whichare hidden deep within mountains of data,plus the ever-evolving mathematics neededto grapple with these conditions effectivelybecause the data won’t fit on one computer anymoreFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves you are here89
  90. 90. Cascading: Workflow AbstractionFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsDesign Patterns for Workflows,Across Departments90
  91. 91. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesenduses91
  92. 92. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesANSI SQL for ETL92
  93. 93. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logic93
  94. 94. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive models94
  95. 95. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive modelsANSI SQL for ETL most of the licensing costs…95
  96. 96. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logicmost of the project costs…96
  97. 97. ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcea compiler sees it all…cascading.org97
  98. 98. a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "etl" ).addSource( "example.employee", emplTap ).addSource( "example.sales", salesTap ).addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner().setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );cascading.org98
  99. 99. a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "classifier" ).addSource( "input", inputTap ).addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner().setPMMLInput( new File( pmmlModel ) ).retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );99
  100. 100. cascading.orgETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcevisual collaboration for the business logic is a greatway to improve how teams work together:Literate Programming, Don Knuthwww-cs-faculty.stanford.edu/~uno/lp.htmlFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleads100
  101. 101. ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsvisual collaboration for the business logic is a greatway to improve how teams work together:Literate Programming, Don Knuthwww-cs-faculty.stanford.edu/~uno/lp.htmlmultiple departments, working in their respectiveframeworks, integrate results into a combined app,which runs at scale on a cluster… business processcombined in a common space (DAG) for flowplanners, compiler, optimization, troubleshooting,exception handling, notifications, security audit,performance monitoring, etc.cascading.org101
  102. 102. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony102
  103. 103. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony1. End Use Cases, the drivers103
  104. 104. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony2. A new kind of team process104
  105. 105. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony3. Abstraction layer as optimizingmiddleware, e.g., Cascading105
  106. 106. WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony4. Distributed OS, e.g., Mesos106
  107. 107. Enterprise DataWorkflowswith CascadingO’Reilly, 2013amazon.com/dp/1449358721references…107
  108. 108. blog, dev community, code/wiki/gists, maven repo,commercial products, career opportunities, newsletter:cascading.orgzest.to/group11github.com/Cascadingconjars.orggoo.gl/KQtULconcurrentinc.combit.ly/pxnnewsdrill-down…108

×