• Save
Hadoop and Beyond
 

Hadoop and Beyond

on

  • 2,615 views

A talk given in Austin, Texas on 2013-05-28 about how cognitive bias interferes with leveraging distributed systems for large-scale apps. Also, about design patterns for Enterprise data workflows. ...

A talk given in Austin, Texas on 2013-05-28 about how cognitive bias interferes with leveraging distributed systems for large-scale apps. Also, about design patterns for Enterprise data workflows. http://hadoop-and-beyond-austin.eventbrite.com/

Statistics

Views

Total Views
2,615
Views on SlideShare
1,740
Embed Views
875

Actions

Likes
9
Downloads
0
Comments
0

4 Embeds 875

http://liber118.com 859
https://twitter.com 8
http://54.245.224.94 7
http://www.liber118.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and Beyond Hadoop and Beyond Presentation Transcript

  • Paco Nathanbit.ly/pxnnews@pacoid“Hadoop and Beyond”1
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsContents:1. Conceptual Map2. Design Patterns2
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop3
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopFirstPrinciplesBeyondHadoop4
  • First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/O5
  • First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/OCPU6
  • First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/ORAM7
  • First Principleswe are taught to think of computing resourcesin terms of Von Neumann architecturein other words, we characterize the computingresources by CPU, RAM, I/OI/O8
  • First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space9
  • First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space• okay, maybe the CPU was multi-core…• okay, maybe RAM paged out to virtual memory…• okay, maybe the disks were in a RAID config…10
  • First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space• okay, maybe the CPU was multi-core…• okay, maybe RAM paged out to virtual memory…• okay, maybe the disks were in a RAID config…or there were extra caches, or separate busses, etc.but essentially those were incremental extensionsto aVon Neumann architecture…11
  • First Principlesback in the day, all the tables required for agiven database could fit onto one computer,with one memory space, and one file space• okay, maybe the CPU was multi-core…• okay, maybe RAM paged out to virtual memory…• okay, maybe the disks were in a RAID config…or there were extra caches, or separate busses, etc.but essentially those were incremental extensionsto aVon Neumann architecture…a machine created in his image, if you willNB: credit should go to Eckert and Mauchly, inventors of the ENIAC12
  • First Principlesa generation of computer scientists has beentaught to think “relational” – data on a DB serverRDBMS made sense, with their indexes, b-trees,normal forms, etc.Q: need to query bigger data?A: simple, buy or lease a bigger DB server13
  • First Principlesa generation of computer scientists has beentaught to think “relational” – data on a DB serverRDBMS made sense, with their indexes, b-trees,normal forms, etc.Q: need to query bigger data?A: simple, buy or lease a bigger DB serverhowever, that all changed…some of the issues encountered in large-scaledata teams are, to put it politely, obscurestarting from first principles, let’s explore amap of some important points to consider14
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves15
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop Topologies16
  • Topologieslargely due to the rapid rise of machine data, circa late 1990s,we use distributed systemsbecause the data won’t fit on one computer anymoreAMZN, EBAY,YHOO, GOOG leveraged horizontal scale-out,based on commodity hardwarepractices at LinkedIn,Apple, Facebook,Twitter, etc., followedfrom those early successesalgorithmic modeling, applied to the aggregation of machinedata, allowed for Big Data to become monetizeda feedback loop evolved – refining aggregate social interactionsinto data products, which in turn made web apps becomemore intelligent17
  • RDBMSSQL Queryresult setsrecommenders+classifiersWeb AppscustomertransactionsAlgorithmicModelingLogseventhistoryaggregationdashboardsProductEngineeringUXStakeholder CustomersDW ETLMiddlewareservletsmodelsCirca 2001: post- big e-commerce successes18
  • RDBMSSQL Queryresult setsrecommenders+classifiersWeb AppscustomertransactionsAlgorithmicModelingLogseventhistoryaggregationdashboardsProductEngineeringUXStakeholder CustomersDW ETLMiddlewareservletsmodelsCirca 2001: post- big e-commerce successes“data products”19
  • TopologiesHadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out basedon commodity hardwarebecause the data won’t fit on one computer anymorea variety of Big Data technologies has since emerged,which can be categorized in terms of topologies andthe CAP Theorem20
  • ApacheWikipediaHadoop, as a topologycomponents which implement MapReduce:• name node / data node• job tracker / task tracker• submit queue• task slots• distributed cache• HDFS21
  • Some Other Topologies…Spark (iterative/interactive)Titan (graph database)Redis (in-memory data grid)Zookeeper (distributed metadata)HBase (columnar data objects)Cassandra (key-value store)Storm (real-time streams)ElasticSearch (search index)MongoDB (document store)Greenplum (MPP)SciDB (array database)22
  • CAP Theorem“You can have at most two of these properties for any shared-data  system… the choice of which feature to discard determines the  nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO)C APstrongconsistencyhighavailabilitypartitiontoleranceeventualconsistencycs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdfjulianbrowne.com/article/viewer/brewers-cap-theorem23
  • financial transactions general ledger in RDBMS C A xad-hoc queries RDS (hosted MySQL) C A xreporting, dashboards like Pentaho C A xlog rotation/persistence like Riak, Cassandra x x Psearch indexes like ElasticSearch, Solr x A Pstatic content, archives S3 (durable storage) x A Pkey/value data objects like HBase C x Pdata prep, ETL, modeling at scale like Hadoop/Cascading C x Pgraph queries like Titan C x PAccess → Frameworks → CAP Theorem Forfeits24
  • financial transactions general ledger in RDBMS C A xad-hoc queries RDS (hosted MySQL) C A xreporting, dashboards like Pentaho C A xlog rotation/persistence like Riak, Cassandra x x Psearch indexes like ElasticSearch, Solr x A Pstatic content, archives S3 (durable storage) x A Pkey/value data objects like HBase C x Pdata prep, ETL, modeling at scale like Hadoop/Cascading C x Pgraph queries like Titan C x PAccess → Frameworks → CAP Theorem Forfeits25
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere26
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere“optimize topologies”27
  • Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.htmleBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdfInktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXMGoogle“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspxMIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.htmlIn their own words…28
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves29
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop Modeling30
  • Modelingback in the day, we worked with practices based ondata modeling1. sample the data2. fit the sample to a known distribution3. ignore the rest of the data4. infer, based on that fitted distributionthat served well with ONE computer, ONE analyst,ONE model… just throw away annoying “extra” datacirca late 1990s: machine data, aggregation, clusters, etc.algorithmic modeling displaced data modelingbecause the data won’t fit on one computer anymore31
  • Two Cultures“A new research community using these tools sprang up.Their goalwas predictive accuracy.The community consisted of young computerscientists, physicists and engineers plus a few aging statisticians.They began using the new tools in working on complex predictionproblems where it was obvious that data models were not applicable:speech recognition, image recognition, nonlinear time series prediction,handwriting recognition, prediction in financial markets.”Statistical Modeling: TheTwo CulturesLeo Breiman, 2001bit.ly/eUTh9Lin other words, seeing the forest for the trees…this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling(machine data for automation/optimization)32
  • Algorithmic Modeling“The trick to being a scientist is to be open to usinga wide variety of tools.” – Breimancirca 2001: Random Forest, bootstrap aggregation, etc.,yield dramatic increases in predictive power over earliermodeling such as Logistic Regressionmajor learnings from the Netflix Prize: the power ofensembles, model chaining, etc.the problems at hand have become simply too big and toocomplex for ONE distribution, ONE model, ONE team…stanford.edu/~lmackey/papers/netflix_story-nas11-slides.pdf33
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves34
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoop Attention35
  • Attentionimpromptu survey:• how many people say they practice some kind of “Agile” process at work?• how many people say that they DON’T practice “Agile” ?• how many people say they are in a lean startup?Q:with respect to Big Data practices,how is that working out?Abby Fichtner vimeo.com/2779740836
  • Agile Data?some people see a reconciliation of Agile process and Big Data…Agile DataRussell Jurney, 2013amazon.com/dp/1449326269“Run like a studio, not an assembly line.”37
  • Perhaps Notgreat values, wrong domain…that worked when we were building features in web appsAgile represents industrialization of software engineering,codifying social interactions, compartmentalizing attentionmeanwhile, Data Science is inherently multi-disciplinary:• teams of people with complementary skill sets• actionable insights require weeks/months, not hours• variance and statistical thinking are foreign to CSLinkedIn-style problems circa 2011 required certain skills…manipulating the Newtonian physics of data… that moneymay be mostly off the table by nowBig Data opportunities ahead require different math?38
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves39
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopBusiness40
  • Business DisruptionGeoffrey MooreMohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:what Amazon did to the retail sector… has put the entire Global 1000on notice over the next decade… data as the major force… mostlythrough apps – verticals, leveraging domain expertiseMichael StonebrakerINGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012:complex analytics workloads are now displacing SQL as the basisfor Enterprise appsLarry PageCEO, Google / Wired, 2013:create products and services that are 10 times better than thecompetition… thousand-percent improvement requires rethinkingproblems entirely, exploring the edges of what’s technically possible,and having a lot more fun in the process41
  • algorithmic modeling + machine data+ curation, metadata + Open Datadata products, as feedback into automationevolution of feedback loopsless about “bigness”, more about complexityinternet of things + A/D conversion+ complex analyticsaccelerated evolution, additional feedback loopsorders of magnitude higher data ratesInternet ofThings accelerates this process of disruptionBusiness Driverssource: National Geographic“A kind of Cambrian explosion”source: National Geographic42
  • Internet of Things43
  • A Thought Exerciseconsider that when a company like Catepillar movesinto data science, they won’t be building the world’snext search engine or social networkthey will most likely be optimizing supply chain,optimizing fuel costs, automating data feedbackloops integrated into their equipment…that’s a $50B company,in a market segment worth $250Bupcoming: tractors as drones –guided by complex, distributed data appsOperations Research –crunching amazing amounts of data44
  • Alternatively…climate.com45
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves46
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopAlgorithms47
  • Algorithmsmany algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ years agoMapReduce is Good Enough?Jimmy Lin, UMDumiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdfastrophysics and genomics are light years ahead in sophisticatedalgorithms work – as Breiman suggested in 2001 – which may takea while to percolate into industryother game-changers:• streaming algorithms, sketches, probabilistic data structures• significant “Big O” complexity reduction (e.g., skytree.net)• better architectures and topologies (e.g., GPUs and CUDA)• partial aggregates – parallelizing workflows48
  • How much does it cost you to earn $1B?also, take a moment to check this out…(IMHO most interesting algorithm work recently)QR factorization of a “tall-and-skinny” matrix• used to solve many data problems at scale,e.g., PCA, SVD, etc.• numerically stable with efficient implementationon large-scale Hadoop clusterssuppose that you have a sparse matrix of customerinteractions where there are 100MM customers,with a limited set of outcomes…cs.purdue.edu/homes/dgleichstanford.edu/~arbensongithub.com/ccsevers/scalding-linalgDavid Gleich, slideshare.net/dgleichTristan Jehan49
  • How much does it cost you to earn $1B?also, take a moment to check this out…(IMHO most interesting algorithm work recently)QR factorization of a “tall-and-skinny” matrix• used to solve many data problems at scale,e.g., PCA, SVD, etc.• numerically stable with efficient implementationon large-scale Hadoop clusterssuppose that you have a sparse matrix of customerinteractions where there are 100MM customers,with a limited set of outcomes…cs.purdue.edu/homes/dgleichstanford.edu/~arbensongithub.com/ccsevers/scalding-linalgDavid Gleich, slideshare.net/dgleichTristan Jehandistributed algorithms for high ROIuse cases on cost-effective clusteredresources…we’re learning how to do it right50
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves51
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopPersonality52
  • Personalitywe have perhaps built computers (once named “electronicbrains”) in the image of JohnVon Neumann, et al.: standalonegenius, aristotelian uber-geek, incredible capacity for memoryand logic, overbearing, not particularly cooperative…one can almost imagine a war-time dialogue,“Get one of theseguys in the room, they’ll solve anything!” … as a result, decadesof mutually assured destruction for global strategyQ:have we created software engineering practices which selected forthis kind of personality? selecting for “lone wolf” guys, sociallyawkward, ONE person who can understand an entire code base,able to out-logic and out-argue the rest of the room… charmingfellow, reallyhave we enabled software process to box these personalitiesinto something resembling teams? along with overtly describedrules for social conventions… silos, in other words53
  • Chasing Unicornssilos… but didn’t that all change?because the data won’t fit on one computer anymoreleverage with data science teams is where organizationstear down internal silos, socializing hard problemsdata won’t fit on one computer anymore, problems won’tfit in one department anymore, the code base won’t fitinto one uber-geek’s memory recall anymore…so we embrace distributed systems for solutionsQ:“Why aren’t there more women in engineering?”IMHO, we’re trying to select for a personality whichdoesn’t exist, and would not resolve current challenges;meanwhile, my data science teams run about 50/5054
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves55
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopClusters56
  • Clustersa little secret: people like me make a good living byleveraging high ROI apps based on clusters, and sothe execs agree to build out more data centers…clusters for Hadoop/Hive/HBase, clusters for Memcached,for Cassandra, for MySQL, for Storm, for Nginx, etc.this becomes expensive!a single class of workloads on a given cluster is simplerto manage; but terrible for utilizationleveragingVMs and various notions of “cloud” helpsCloudera, Hortonworks, probably EMC soon: sell a notionof “Hadoop as OS” All your workloads are belong to usregardless of how architectures change, death and taxeswill endure: servers fail, and data must moveGoogle Data Center, Fox News~200257
  • Operating Systems, reduxmeanwhile, GOOG is 3+ generations ahead,with much improved ROI on data centersJohn Wilkes, et al.Borg/Omega:“10x” secret sauceyoutu.be/0ZFMlO98Jkc0%25%50%75%100%RAILS CPULOADMEMCACHEDCPU LOAD0%25%50%75%100%HADOOP CPULOAD0%25%50%75%100%t t0%25%50%75%100%RailsMemcachedHadoopCOMBINED CPU LOAD (RAILS,MEMCACHED, HADOOP)Florian Leibert, Chronos/Mesos @ AirbnbMesos, open source cloud OS – like Borgincubator.apache.org/mesos58
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves59
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopTrendlines60
  • TrendlinesBig Data? we’re just getting started:• ~12 exabytes/day, jet turbines on commercial flights• Google self-driving cars, ~1 Gb/s per vehicle• National Instruments initiative: Big Analog Data™• 1m resolution satellites skyboximaging.com• open resource monitoring reddmetrics.com• Sensing XChallenge nokiasensingxchallenge.orgconsider the implications of Jawbone, Nike, etc.,plus the secondary/tertiary effects of Google Glass7+ billion people, instrumented better than … how wehave Nagios instrumenting our web servers right nowplus the business implications given that much of theGlobal 1000 is positioned to be disrupted technologyreview.com/...61
  • Three Laws, or more?meanwhile, architectures evolve toward much, much larger data…pistoncloud.com/ ...Rich Freitas, IBM ResearchQ:what kinds of evolution in topologies couldthis imply?62
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves63
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopLanguages64
  • LanguagesJVM-based languages became popular for Big Data open sourcetechnologies:• partly becauseYHOO adopted Hadoop, etc.• partly because Enterprise IT shops have J2EE expertise• partly because of functional languages: Clojure, ScalaJVM has its drawbacks, especially for low-latency use casesample use of languages such as Python and Erlang in Big Datapractices, plus keep in mind that Google uses C++FunctionalThinkingNeal Fordyoutu.be/plSZIkLodDMa hunch: issues about current programming languages aresecondary to culture65
  • Functional Programming for Big DataWordCount with token scrubbing…Apache Hive: 52 lines HQL + 8 lines Python (UDF)compared toScalding: 18 lines Scala/Cascadingfunctional programming languages help reducesoftware engineering costs at scale, over time66
  • references…“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”Vitaly Gordon [ especially see slide #9 ]slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedinElements Of Functional ProgrammingChris Readeamazon.com/dp/020112915967
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves68
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopOrganization69
  • OrganizationHow Do Committees Invent?Melvin Conway, 1968melconway.com/research/committees.htmlManu Cornet bonkersworld.net“Any organization that designs a system(defined more broadly here than justinformation systems) will inevitablyproduce a design whose structure is acopy of the organization’s communicationstructure.”Q:• does this fit with software process?• does this fit with distributed apps?see also:haacked.com/archive/2013/05/13/applying-conways-law.aspx70
  • Cooperationperhaps we have selected for the wrongpersonality to idealize…linkedin.com/today/post/article/20130520190305-110300724-why-nothing-not-even-software-can-eat-the-worldAll long-term success depends on elicitingthe voluntary support of an ecosystem.As the African proverb says,“If you wantto go fast, go alone; if you want to go far,go with others.” – Geoffrey Moore71
  • discoverydiscoverymodelingmodelingintegrationintegrationappsappssystemssystemsbusiness process,stakeholderdata prep, discovery,modeling, etc.software engineering,automationsystems engineering,accessdatascienceDataScientistApp DevOpsDomainExpertintroducedTeam Composition: Needs × Roles72
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves73
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopArchitecture74
  • ArchitectureRich Hickey, Nathan Marz, Stuart Sierra, et al.:functional programming to help reducecosts over time1. technical debt? this is how an organizationbuilds a culture to avoid it2. Conways Law corollary: model teams andcommunication based on properties of thedesired architecture3. also consider Mesos/Borg: schedule datato be located where [CPU, RAM, I/O, surety]will become availableRich Hickey, infoq.com/presentations/Simple-Made-Easy75
  • Lambda ArchitectureBig DataNathan Marz, James Warrenmanning.com/marz• batch layer (immutable data, idempotent ops)• serving layer (to query batch)• speed layer (transient, cached “real-time”)• combining results76
  • Pattern Languagestructured method for solving large, complex designproblems, where the syntax of the language ensuresthe use of best practices – i.e., conveying expertiseFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsA Pattern LanguageChristopher Alexander, et al.amazon.com/dp/019501919977
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves78
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopCulture79
  • CultureNotes from the Mystery Machine BusSteveYegge, Googlegoo.gl/SeRZaconsider these perspectivesin light of Conway’s Law…“conservatism” “liberalism”(mostly) Enterprise (mostly) Start-Uprisk management customer experimentsassurance flexibilitywell-defined schema schema follows codeexplicit configuration conventiontype-checking compiler interpreted scriptswants no surprises wants no impedimentsJava, Scala, Clojure, etc. PHP, Ruby, Python, etc.Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.80
  • Two Avenues to the App Layer…scale ➞complexity➞Enterprise: must contend withcomplexity at scale everyday…incumbents extend current practices andinfrastructure investments – using J2EE,ANSI SQL, SAS, etc. – to migrateworkflows onto Apache Hadoop whileleveraging existing staffStart-ups: crave complexity andscale to become viable…new ventures move into Enterprise spaceto compete using relatively lean staff,while leveraging sophisticated engineeringpractices, e.g., Cascalog and Scalding81
  • approximately 80% of the costs for data-related projectsgets spent on data preparation – mostly on cleaning updata quality issues: ETL, log files, etc., generally by socializingthe problemunfortunately, data-related budgets tend to go intoframeworks which can only be used after clean upmost valuable skills:‣ learn to use programmable tools that prepare data‣ learn to understand the audience and their priorities‣ learn to generate compelling data visualizations‣ learn to estimate the confidence for reported results‣ learn to automate work, making analysis repeatabled3js.orgWhat is needed?82
  • BeyondHadoopBeyondHadoopFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves83
  • FirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurvesBeyondHadoopBeyondHadoopLearningCurves84
  • Learning Curvesdifficulties in the commercial use of distributed systemsoften get represented as issues of managing complexitymuch of the risk in managing a data science team is aboutbudgeting for learning curve: some orgs practice a kind ofengineering “conservatism”, with highly structured processand strictly codified practices – people learn a few thingswell, then avoid having to struggle with learning many newthings perpetually…that approach leads to enormous teams and low ROI scale➞complexity➞ultimately, the challenge is aboutmanaging learning curves withina social context85
  • Managementultimately, the challenge is about managinglearning curves within a social contextest. cost of individual learning, initial implest.costofteamre-learning,lifecyclesome technologies constrain theneed to learn, others acceleratere-learning prior business logic…choose the latter, FTW!86
  • Managementultimately, the challenge is about managinglearning curves within a social contextest. cost of individual learning, initial implest.costofteamre-learning,lifecyclesome technologies constrain theneed to learn, others acceleratere-learning prior business logic…choose the latter, FTW!IMHO, the “agile” part was intended to beabout shared learnings; while the “lean” partwas about how much you have on your plateat any one time87
  • blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.htmlThrowYour Life a CurveWhitney JohnsonAggressively Pro-Active Learning• deconstruction of the cognitive bias One Size Fits All• “makes a compelling case for personal disruption”• “plan your career around learning curves”• hire people who learn/re-learn efficiently88
  • Summaryto be competitive globally with Big Datarequires learning many technologies –then learning the nuances of a code base forwhich the team is responsible, learning theever-changing surprises and insights whichare hidden deep within mountains of data,plus the ever-evolving mathematics neededto grapple with these conditions effectivelybecause the data won’t fit on one computer anymoreFirstPrinciplesTopologiesLanguagesModeling AttentionClustersAlgorithmsTrendlinesOrganizationArchitectureCultureBusinessPersonalityLearningCurves you are here89
  • Cascading: Workflow AbstractionFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsDesign Patterns for Workflows,Across Departments90
  • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesenduses91
  • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesANSI SQL for ETL92
  • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logic93
  • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive models94
  • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive modelsANSI SQL for ETL most of the licensing costs…95
  • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logicmost of the project costs…96
  • ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcea compiler sees it all…cascading.org97
  • a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "etl" ).addSource( "example.employee", emplTap ).addSource( "example.sales", salesTap ).addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner().setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );cascading.org98
  • a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "classifier" ).addSource( "input", inputTap ).addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner().setPMMLInput( new File( pmmlModel ) ).retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );99
  • cascading.orgETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcevisual collaboration for the business logic is a greatway to improve how teams work together:Literate Programming, Don Knuthwww-cs-faculty.stanford.edu/~uno/lp.htmlFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleads100
  • ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsvisual collaboration for the business logic is a greatway to improve how teams work together:Literate Programming, Don Knuthwww-cs-faculty.stanford.edu/~uno/lp.htmlmultiple departments, working in their respectiveframeworks, integrate results into a combined app,which runs at scale on a cluster… business processcombined in a common space (DAG) for flowplanners, compiler, optimization, troubleshooting,exception handling, notifications, security audit,performance monitoring, etc.cascading.org101
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony102
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony1. End Use Cases, the drivers103
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony2. A new kind of team process104
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony3. Abstraction layer as optimizingmiddleware, e.g., Cascading105
  • WorkflowRDBMSnear timebatchservicestransactions,contentsocialinteractionsWeb Apps,Mobile, etc.HistoryData Products CustomersRDBMSLogEventsIn-MemoryData GridHadoop,etc.Cluster SchedulerProdEngDWUse Cases Across Topologiess/wdevdatasciencediscovery+modelingPlannerOpsdashboardmetricsbusinessprocessoptimizedcapacitytapsDataScientistApp DevOpsDomainExpertintroducedcapabilityexistingSDLCCirca 2013: clusters everywhere – Four-Part Harmony4. Distributed OS, e.g., Mesos106
  • Enterprise DataWorkflowswith CascadingO’Reilly, 2013amazon.com/dp/1449358721references…107
  • blog, dev community, code/wiki/gists, maven repo,commercial products, career opportunities, newsletter:cascading.orgzest.to/group11github.com/Cascadingconjars.orggoo.gl/KQtULconcurrentinc.combit.ly/pxnnewsdrill-down…108