Intro to CascadingPaco Nathan                            Document                            Collection                   ...
Intro to Cascading            Document            Collection                                         Scrub                ...
Cascading API: purpose  ‣ simplify data processing development and deployment  ‣ improve application developer productivit...
Cascading API: a few facts  Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.  in production (~5 yrs)...
Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications  on Hadoop using t...
data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa          “conserva...
Cascading API: adoption    As Enterprise apps move into    Hadoop and related BigData    frameworks, risk profiles shift   ...
enterprise data workflows Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc. …in other words, “plumb...
data workflows: team  ‣ Business Stakeholder POV:    business process management for workflow orchestration (think BPM/BPEL...
data workflows: layers    business    domain expertise, business trade-offs,    process     operating parameters, market p...
data workflows: SQL         Relational           SQL parser           logical plan,     optimized based on stats          ...
data workflows: SQL vs. JVM         Relational             Cascading + Driven           SQL parser             SQL-92 comp...
Intro to Cascading             Document             Collection                                          Scrub             ...
inflection point huge Internet successes after 1997 holiday season…          1997 AMZN, EBAY, Inktomi (YHOO Search), then G...
inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “A...
primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-split...
the world before…BI, SQL, and highlyoptimized code
data innovation: circa 1996                            Stakeholder                   Customers     Excel pivot tables   Po...
the world after…machine learning,leveraging log files
data innovation: circa 2001    Stakeholder                    Product                   Customers      dashboards         ...
the world ahead…what our customersare doing now
data innovation: circa 2013                                                                                             Cu...
a key difference…
statistical thinking       Process               Variation                 Data              Tools  employing a mode of th...
reference  by Leo Breiman  Statistical Modeling:  The Two Cultures  Statistical Science, 2001  bit.ly/eUTh9L  also check o...
Intro to Cascading             Document             Collection                                          Scrub             ...
MapReduce architecture ‣ name node + data nodes ‣ job tracker + task trackers ‣ submit queue ‣ task slots ‣ HDFS ‣ distrib...
MapReduce: how it works   map(k1, v1) → list(k2, v2)   reduce(k2, list(v2)) → list(k3, v3) the property of data independen...
a brief history… circa 1979 – Stanford, MIT, CMU, etc.  set/list operations in LISP, Prolog, etc., for parallel processing...
CAP theorem purpose: theoretical limits for data access patterns essence:    ‣ consistency    ‣ availability    ‣ partitio...
data access patterns because the world is not made of data warehouses… a handful of common data access patterns are preval...
access → frameworks → forfeits  financial transactions               general ledger in RDBMS            CAx  ad-hoc queries...
access → frameworks → forfeits  financial transactions               general ledger in RDBMS            CAx  ad-hoc queries...
parallel computation parallelism allows for horizontal scale-out, which create business “levers” in cost/performance at sc...
reference  by Tom White  Hadoop:The Definitive Guide  O’Reilly, 2009  amazon.com/dp/1449311520  see also:  Cluster Computin...
Intro to Cascading            Document            Collection                                         Scrub                ...
core values  Data Science teams develop actionable insights,  building confidence for decisions  that work may influence a f...
most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cl...
the science in data science?                                                         edoMpUsserD:IUN                      ...
team process = needs                  help people ask the    discovery     right questions                  allow automati...
team composition = roles       Domain       Expert                               business process,                        ...
matrix = needs × roles                                            nn         o         overy           very      elliing  ...
matrix: example team                                             nn          o          overy            very      elliing...
typical hand-offs            integrity                      availability              discovery            communications ...
use case: marketing funnel  •   must optimize a very large ad spend  •   different vendors report different metrics       ...
use case: ecommerce fraud  • sparse data means lots of missing values                                                     ...
use case: customer segmentation  • many millions of customers, hard to determine      which features resonate             ...
use case: monetizing content  • need to suggest relevant content which would                                              ...
reference  by DJ Patil  Data Jujitsu  O’Reilly, 2012  amazon.com/dp/B008HMN5BE  Building Data Science Teams  O’Reilly, 201...
Intro to Cascading             Document             Collection                                          Scrub             ...
“Cascading for the Impatient”  cascading.org/category/impatient/  ‣ a series of introductory tutorials and code samples  ‣...
1: copy                       public class                         Main                         {                         ...
wait!  ten lines of code  for a file copy…  seems like a lot.
same JAR, any scale…                                                            MegaCorp Enterprise IT:                   ...
2: word countDocumentCollection                Tokenize                           GroupBy        M                   token...
Cascading / Java                                                                             DocumentString docPath = args...
Scalding / Scala                                               Document                                               Coll...
Cascalog / Clojure                                                      Document                                          ...
Hive                                                   Document                                                   Collecti...
Pig                                                   Document                                                   Collectio...
3: wc + scrubDocumentCollection                        Scrub   GroupBy             Tokenize                        token  ...
4: wc + scrub + stop words  Document  Collection                               Scrub               Tokenize               ...
5: tf-idf                                                                        Unique                 Insert   SumBy    ...
6: tf-idf + tdd                                                                                                Unique     ...
deployed on AWS… elastic-mapreduce --create --name "TF-IDF"    --jar s3n://temp.cascading.org/impatient/part6.jar    --arg...
results?                                                                                                                  ...
comparisons? compare similar code in Scalding (Scala) and Cascalog (Clojure): sujitpal.blogspot.com/2012/08/scalding-for-i...
Intro to Cascading              Document              Collection                                           Scrub          ...
Social Recommender                                               filter                      Twitter                       ...
SocRec: architecture         Twitter                                             filter                                    ...
SocRec: results                           uid        recommend       weight                  carbonfiberxrm     ClosingBell...
City of Palo Alto open data                                                 Regex           Regex                         ...
CoPA: log events
CoPA: results                                      0.12                                                               Esti...
drill-down  blog, code/wiki/gists, jars, list, DevOps products:  cascading.org/  github.org/Cascading/  conjars.org/  goo....
Intro to Cascading (SpringOne2GX)
Upcoming SlideShare
Loading in...5
×

Intro to Cascading (SpringOne2GX)

2,751

Published on

Presented at the SpringOne2GX conference in Washington, DC on 2012-10-17. Email me pnathan AT concurrentinc DOT com for a PDF version if you need it.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,751
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
69
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Intro to Cascading (SpringOne2GX)

    1. 1. Intro to CascadingPaco Nathan Document Collection Scrub Tokenize tokenConcurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHSpnathan@concurrentinc.com Count@pacoid Word Count Copyright @2012, Concurrent, Inc.
    2. 2. Enterprise Apps for Big Datawith Cascading 1. intro: Cascading API 2. backstory: Big Data origins 3. context: Hadoop cliff notes 4. theory: Data Science teams 5. tutorial: for the impatient 6. code: sample apps
    3. 3. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count1. intro:Cascading API
    4. 4. Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageability
    5. 5. Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, Airbnb, Square, Climate Corporation, FlightCaster, Williams-Sonoma partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, contribs by Twitter, Etsy, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debugging, config, scheduling, etc.
    6. 6. Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 infoworld.com/slideshow/65089 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08 drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
    7. 7. data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
    8. 8. Cascading API: adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API for defining and managing Enterprise data workflows
    9. 9. enterprise data workflows Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
    10. 10. data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahls Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection Scrub Tokenize token M ‣ App Developer POV: Stop Word List HashJoin Left RHS Regex token GroupBy token R API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Count Word Count ‣ Systems Engineer POV: a JAR file, has passed CI, available in a Maven repo
    11. 11. data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / major changes in technology now schedule Document Collection Scrub Tokenize token physical M HashJoin Regex Left token plan GroupBy R Stop Word token List RHS Count Word Count Apache Hadoop, in-memory local mode “assembler” compute code substrate …envision GPUs, streaming, etc. machine Splunk, Nagios, Collectd, New Relic, etc. data
    12. 12. data workflows: SQL Relational SQL parser logical plan, optimized based on stats physical plan query history, table stats b-trees, etc. ERD table schema catalog
    13. 13. data workflows: SQL vs. JVM Relational Cascading + Driven SQL parser SQL-92 compliant parser (in progress) logical plan, TODO: logical plan, optimized based on stats optimized based on stats physical plan API “plumbing” query history, app history, table stats tuple stats b-trees, etc. distributed compute substrate: Hadoop, in-memory, etc. ERD flow diagram table schema tuple schema catalog endpoint usage DB
    14. 14. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count2. backstory:Big Data origins
    15. 15. inflection point huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1998 consider this metric: annual revenue per customer / amount of data stored which dropped 100x within a few years after 1997 2004 storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt “conventional wisdom” of RDBMS and BI tools became less viable; however, business cadre was still focused on pivot tables and pie charts… which tends toward inertia! MapReduce and the Hadoop open source stack grew directly out of that contention… however, that effort only solves parts of the puzzle +
    16. 16. inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “All of Fortune 500 is now on notice over the next 10-year period.” Amazon and Google as exemplars of massive disruption in retail, advertising, etc. data as the major force displacing Global 1000 over the next decade, mostly through apps — verticals, leveraging domain expertise Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.) XLDB, 2012: “Complex analytics workloads are now displacing SQL as the basis  for Enterprise apps.”
    17. 17. primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
    18. 18. the world before…BI, SQL, and highlyoptimized code
    19. 19. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
    20. 20. the world after…machine learning,leveraging log files
    21. 21. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
    22. 22. the world ahead…what our customersare doing now
    23. 23. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data Planner Scientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS
    24. 24. a key difference…
    25. 25. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
    26. 26. reference by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/
    27. 27. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count3. context:Hadoop cliff notes
    28. 28. MapReduce architecture ‣ name node + data nodes ‣ job tracker + task trackers ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia Apache
    29. 29. MapReduce: how it works map(k1, v1) → list(k2, v2) reduce(k2, list(v2)) → list(k3, v3) the property of data independence among tasks allows for parallel processing … maybe, if the stars are all aligned :) MapReduce is mostly about fault tolerance, and how to leverage “commodity hardware” to replace “big iron” solutions… where “big iron” might apply to Oracle + NetApp. or perhaps an IBM zSeries mainframe… or something else that’s expensive, undoubtably. bonus for math geeks: see any concerns about O(n) complexity, given Amdahl’s Law plus the functional definitions listed above? keep in mind that each phase cannot conclude and progress to the next phase until after each of its tasks has successfully completed.
    30. 30. a brief history… circa 1979 – Stanford, MIT, CMU, etc. set/list operations in LISP, Prolog, etc., for parallel processing www-formal.stanford.edu/jmc/history/lisp/lisp.htm circa 2004 – Google MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat labs.google.com/papers/mapreduce.html circa 2006 – Apache Hadoop, originating from the Nutch Project Doug Cutting research.yahoo.com/files/cutting.pdf circa 2008 – Yahoo web scale search indexing Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/ circa 2009 – Amazon AWS Elastic MapReduce Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/
    31. 31. CAP theorem purpose: theoretical limits for data access patterns essence: ‣ consistency ‣ availability ‣ partition tolerance best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) translated: cost of doing business www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf julianbrowne.com/article/viewer/brewers-cap-theorem
    32. 32. data access patterns because the world is not made of data warehouses… a handful of common data access patterns are prevalent learn to recognize these for any given problem typically expressed in terms of trade-offs: ‣ speed & volume (latency and throughput) ‣ reads & writes (access and storage) ‣ consistency / availability / partition tolerance
    33. 33. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
    34. 34. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
    35. 35. parallel computation parallelism allows for horizontal scale-out, which create business “levers” in cost/performance at scale NB: MapReduce provides a compute framework which is part-parallel and part-serial… which tends to complicate app development most hard problems in industry have portions which do not allow data independence, or which require iteration current efforts in massively parallel algorithms research may help to parallelize problems and reduce iteration – estimates are 3-5 years out for industry use GPUs and other hardware architecture advancements will likely make Hadoop unrecognizable 3-5 years out
    36. 36. reference by Tom White Hadoop:The Definitive Guide O’Reilly, 2009 amazon.com/dp/1449311520 see also: Cluster Computing and MapReduce Lectures code.google.com/edu/submissions/mapreduce-minilecture/listing.html
    37. 37. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count4. theory:Data Science teams
    38. 38. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. an interdisciplinary pursuit which requires teams, not sole players
    39. 39. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable the rest of the skills – modeling, D3 algorithms, etc. – those are secondary
    40. 40. the science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE ‣ estimate probability woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU ‣ calculate analytic variance edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ make use of learning theory + collab with DevOps, Stakeholders + reduce our work to cron entries
    41. 41. team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at integration scale to customers build smarts into apps product features Gephi keep infrastructure systems running, cost-effective
    42. 42. team composition = roles Domain Expert business process, stakeholder data science Data data prep, discovery, Scientist modeling, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS App Dev software engineering, Count automation Word Count Ops systems engineering, access introduced capability
    43. 43. matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops
    44. 44. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
    45. 45. typical hand-offs integrity availability discovery communications people vendor data sources Query data Query Hosts query BI & dashboards warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
    46. 46. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
    47. 47. use case: ecommerce fraud • sparse data means lots of missing values stat.berkeley.edu • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
    48. 48. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
    49. 49. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
    50. 50. reference by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE
    51. 51. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count5. tutorial:for the impatient
    52. 52. “Cascading for the Impatient” cascading.org/category/impatient/ ‣ a series of introductory tutorials and code samples ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
    53. 53. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete();     } 1 mapper   } 0 reducers10 lines code
    54. 54. wait! ten lines of code for a file copy… seems like a lot.
    55. 55. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
    56. 56. 2: word countDocumentCollection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer18 lines code gist.github.com/3900702
    57. 57. Cascading / Java DocumentString docPath = args[ 0 ]; CollectionString wcPath = args[ 1 ]; Tokenize GroupBy M token CountProperties properties = new Properties(); R WordAppProps.setApplicationJarClass( properties, Main.class ); CountHadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();
    58. 58. Scalding / Scala Document Collection Tokenize GroupBy M token Count// Sujit Pal R Word Count// sujitpal.blogspot.com/2012/08/scalding-for-impatient.htmlpackage com.mycompany.impatientimport com.twitter.scalding._class Part2(args : Args) extends Job(args) {  val input = Tsv(args("input"), (docId, text))  val output = Tsv(args("output"))  input.read.    flatMap(text -> word) { text : String => text.split("""s+""") }.    groupBy(word) { group => group.size }.    write(output)}
    59. 59. Cascalog / Clojure Document Collection Tokenize GroupBy M token Count; Paul Lam R Word Count; github.com/Quantisan/Impatient(ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[[](),.)s]+"))(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count)))
    60. 60. Hive Document Collection Tokenize GroupBy M token Count-- Steve Severance R Word Count-- stackoverflow.com/questions/10039949/word-count-program-in-hiveCREATE TABLE input (line STRING);LOAD DATA LOCAL INPATH input.tsvOVERWRITE INTO TABLE input;SELECT word, COUNT(*)FROM input LATERAL VIEW explode(split(text, )) lTable AS wordGROUP BY word;
    61. 61. Pig Document Collection Tokenize GroupBy M token Count-- kudos to Dmitriy Ryaboy R Word CountdocPipe = LOAD $docPath USING PigStorage(t, tagsource) AS (doc_id, text);docPipe = FILTER docPipe BY doc_id != doc_id;-- specify regex to split "document" text lines into token streamtokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, [](),.)) AS token;tokenPipe = FILTER tokenPipe BY token MATCHES w.*;-- determine the word countstokenGroups = GROUP tokenPipe BY token;wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count;-- outputSTORE wcPipe INTO $wcPath USING PigStorage(t, tagsource);EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
    62. 62. 3: wc + scrubDocumentCollection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer22+10 lines code
    63. 63. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1 mapper 1 reducer28+10 lines code
    64. 64. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_idDocumentCollection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
    65. 65. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_idDocumentCollection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
    66. 66. deployed on AWS… elastic-mapreduce --create --name "TF-IDF" --jar s3n://temp.cascading.org/impatient/part6.jar --arg s3n://temp.cascading.org/impatient/rain.txt --arg s3n://temp.cascading.org/impatient/out/wc --arg s3n://temp.cascading.org/impatient/en.stop --arg s3n://temp.cascading.org/impatient/out/tfidf --arg s3n://temp.cascading.org/impatient/out/trap --arg s3n://temp.cascading.org/impatient/out/check aws.amazon.com/elasticmapreduce/
    67. 67. results? doc_id tf-idf token doc02 0.9163 air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 californias doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 desertsdoc_id text doc03 0.9163 downwinddoc01 A rain shadow is a dry area on the lee back side of a mountainous area. …doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain doc02 0.9163 sinkingand cloudcover. doc04 0.9163 suchdoc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a doc04 0.9163 valleymountain. doc05 0.9163 womendoc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of doc03 0.5108 landmountain ranges, such as Californias Death Valley. doc05 0.5108 landdoc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc01 0.5108 leezoink null doc02 0.5108 lee doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry Unique Insert SumBy doc02 0.2231 mountain doc03 0.2231 mountain D doc_id 1 doc_id Document Collection RHS doc04 0.2231 mountain M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M doc01 0.0000 rain M RHS token HashJoin Regex Unique GroupBy DF doc02 0.0000 rain Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF doc03 0.0000 rain GroupBy M doc04 0.0000 rain TF doc_id, token Count doc01 0.0000 shadow Failure Traps GroupBy Count token M R M R doc02 0.0000 shadow Word Count R M R doc03 0.0000 shadow doc04 0.0000 shadow
    68. 68. comparisons? compare similar code in Scalding (Scala) and Cascalog (Clojure): sujitpal.blogspot.com/2012/08/scalding-for-impatient.html based on: github.com/twitter/scalding/wiki github.com/Quantisan/Impatient based on: github.com/nathanmarz/cascalog/wiki
    69. 69. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count6. code:sample apps
    70. 70. Social Recommender filter Twitter stop words tweets calculate QA similiarity threshold min, max Neo4j LDA Redisgithub.com/Cascading/SampleRecommender ‣ social recommender based on Twitter: suggest users who tweet about similar stocks ‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop ‣ uses a stop word list to remove common words, offensive phrases, etc. ‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc. ‣ adapted in Spring by Costin Leau
    71. 71. SocRec: architecture Twitter filter low-freq firehose source stop words tweets batch updates ( uid, tweet, t ) checkpoint: tokenized tweets calculate checkpoint: analysis + QA similiarity token frequency curation checkpoint: similarity similar users thresholds threshold min, max sink sink sink Neo4j: social Redis graph LDA: topic results (uid: uidx, rank) trending
    72. 72. SocRec: results uid recommend weight carbonfiberxrm ClosingBellNews 0.1459 carbonfiberxrm DJFunkyGrrL 0.0870 ClosingBellNews DJFunkyGrrL 0.1491 CloudStocks DJFunkyGrrL 0.1206 ElmoreNicole DJFunkyGrrL 0.1798 EsNeey alexiolo_ 0.8603 ...
    73. 73. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPAGIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shadeM R M R M RHS M HashJoin Estimate Road Left Albedo Geohash CoGroup Segments Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M parkgithub.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIsEnterprise-scale app: road albedo + tree species metadata + geospatial indexing“Find a shady spot on a summer day to walk near downtown and take a call…”
    74. 74. CoPA: log events
    75. 75. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔
    76. 76. drill-down blog, code/wiki/gists, jars, list, DevOps products: cascading.org/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/ pnathan@concurrentinc.com @pacoid

    ×