Hadoop Summit, Cascading

•Download as KEY, PDF•

2 likes•768 views

Paco Nathan

Hadoop Summit 2009, Cascading dev talk

Strata CA 2018-03-08 https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223 Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.

Human-in-the-loop: a design pattern for managing teams that leverage ML

Paco Nathan

Strata Singapore 2017 session talk 2017-12-06 https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611 Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models. This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL: * When is HITL indicated vs. when isn’t it applicable? * How do HITL approaches compare/contrast with more “typical” use of Big Data? * What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning? * Experiences training and managing a team which uses HITL at scale * Caveats to know ahead of time: * In what ways do the humans involved learn from the machines? * In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).

Human-in-a-loop: a design pattern for managing teams which leverage ML

Paco Nathan

Human-in-a-loop: a design pattern for managing teams which leverage ML Big Data Spain, 2017-11-16 https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models. This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL: * When is HITL indicated vs. when isn't it applicable? * How do HITL approaches compare/contrast with more "typical" use of Big Data? * What's the relationship between use of HITL and preparing an organization to leverage Deep Learning? * Experiences training and managing a team which uses HITL at scale * Caveats to know ahead of time * In what ways do the humans involved learn from the machines? In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).

Humans in a loop: Jupyter notebooks as a front-end for AI

Paco Nathan

JupyterCon NY 2017-08-24 https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies. The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc. Jupyter notebooks serve as one part configuration file,  one part data sample, one part structured log,  one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases. This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.

Humans in the loop: AI in open source and industry

Paco Nathan

Nike Tech Talk, Portland, 2017-08-10 https://niketechtalks-aug2017.splashthat.com/ O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner. This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon. Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do. In particular, we'll show two open source projects in Python from O'Reilly's AI team: • pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics  • nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation

Computable Content

Paco Nathan

Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx. https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859 https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858

Computable Content: Lessons Learned

Paco Nathan

Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.

SF Python Meetup: TextRank in Python

Paco Nathan

See 2020 update: https://derwen.ai/s/h88s SF Python Meetup, 2017-02-08 https://www.meetup.com/sfpython/events/237153246/ PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.

Use of standards and related issues in predictive analytics

Paco Nathan

Data Science in 2016: Moving Up

Paco Nathan

Data Science Reinvents Learning?

Paco Nathan

Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose. http://meetup.com/SF-Bay-ACM/events/221693508/ Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets. O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages

Jupyter for Education: Beyond Gutenberg and Erasmus

Paco Nathan

GalvanizeU Seattle: Eleven Almost-Truisms About Data

Paco Nathan

http://www.meetup.com/Seattle-Data-Science/events/223445403/ Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.

Microservices, containers, and machine learning

Paco Nathan

http://www.oscon.com/open-source-2015/public/schedule/detail/41579 In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities. Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as: * What are the trending topic summaries? * Who are the leaders in the community for various topics? * Who discusses most frequently with whom? This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

Graph Analytics in Spark

Paco Nathan

https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472 Big Brains meetup hosted by BloomReach, 2015-06-04 Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.

Apache Spark and the Emerging Technology Landscape for Big Data

Paco Nathan

QCon São Paulo: Real-Time Analytics with Spark Streaming

Paco Nathan

"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26 http://qconsp.com/presentation/real-time-analytics-spark-streaming This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale. The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale. We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Paco Nathan

A New Year in Data Science: ML Unpaused

Paco Nathan

Hadoop Summit, Cascading

Recommended

Recommended

More Related Content

More from Paco Nathan

More from Paco Nathan (20)