This document discusses Pattern, an open source project for migrating predictive models onto Apache Hadoop. It aims to provide enterprise data workflows, sample code, and a roadmap for migrating predictive model markup language (PMML) models to run at scale on Hadoop clusters. The document also discusses how Pattern customers have experimented with using it for their predictive analytics needs.
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
The Other Way of Doing Big Data: Declarative, Decoupled, Federated, Simple, and Resilient.
Also known as: How to Win at Scale and its Influence of People. Originally presented by Flip Kromer to the Research Board, http://www.researchboard.com/ June 2012
HP Microsoft SQL Server Data Management SolutionsEduardo Castro
In this presentation was used in the MSDN WebCast and we cover some details about the hardware offerings to run SQL Server DataWarehouse, some detail about HP Hardware is shown.
Best Regards,
Ing. Eduardo Castro Martinez
http://ecastrom.blogspot.com
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
Description of the work done for the Semantic Markup activity of the Semantic Sensor Networks Incubator activity (at W3C).
Presentation made at the Australian Ontology Workshop, Melbourne, December 2009. The full title of the paper is: "Review of semantic enablement techniques used in geospatial and semantic standards for legacy and opportunistic mashups" (and it is available via crpit.com)
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
Panorama de l'utilisation des médias sociaux dans les collectivités localesEmilie Marquois
Présentation du Web 2.0, des territoires 2.0 et quelques exemples d'utilisation des médias sociaux dans les collectivités locales. Présentation réalisée dans le cadre des Assises Nationales des TIC 2009 (Marseille)
Open Data: From the Information Age to the Action Age (PDF with notes)Tim O'Reilly
This is the presentation I made at the UK Department for International Aid/Omidyar Network OpenUp! conference in London on November 13, 2012. I talk about open government not as a platform for transparency or citizen engagement, but for a developer ecosystem building useful services. A video of this talk is available at http://www.youtube.com/watch?feature=player_embedded&v=OIlxdpfu71o
Some Lessons for Startups (pdf with notes)Tim O'Reilly
My talk at the Stanford Technology Ventures Program on March 6, 2013. I talk about some technical and business lessons from Square, Uber, AirBnB, and the Google Autonomous Vehicle that are applicable to today's startups.
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON Byrum
Presented by Diane Mueller, ActiveState @pythondj
Are you unsure what the security and privacy implications are for sensitive corporate data? US Patriot Act is causing many of us to hesitate on leveraging the cloud.
Organizations are thinking long and hard about the legal and regulatory implications of cloud computing. When it comes to actual corporate data, no matter what the efficiency gains are, legal departments are often directing IT departments to steer clear of any service that eliminates their ability to keep potential sensitive information out of the hands of Federal prosecutors.
Despite all the hype about every application moving into the cloud, some practical patterns are starting to emerge in the types of data corporations are willing to move to the cloud.
Covered in this session:
(a) Introduction to the US Patriot Act and Data Privacy issues Implications for on Cloud Computing Jurisdictional Issues
(b) Best Practices & Practical Patterns Classes of applications that best leverage the cloud
(c)What types of applications should stay on-premise Private Cloud Model(s) Building a Compliant Cloud Strategy
For more information:
email me at dianem {at} activestate {period} com
or ping me on twitter at @pythondj
visit http://activestate.com/stackato
Le bilan mobilité permet au bénéficiaire de pouvoir faire un bilan de compétences et d’acquérir les techniques de recherche d’emploi adaptées à son projet professionnel.
C’est une démarche libre et volontaire du bénéficiaire. Elle peut être proposée par l’entreprise dans le cadre d’une mobilité interne ou d’un départ de l’entreprise.
The Other Way of Doing Big Data: Declarative, Decoupled, Federated, Simple, and Resilient.
Also known as: How to Win at Scale and its Influence of People. Originally presented by Flip Kromer to the Research Board, http://www.researchboard.com/ June 2012
HP Microsoft SQL Server Data Management SolutionsEduardo Castro
In this presentation was used in the MSDN WebCast and we cover some details about the hardware offerings to run SQL Server DataWarehouse, some detail about HP Hardware is shown.
Best Regards,
Ing. Eduardo Castro Martinez
http://ecastrom.blogspot.com
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
Description of the work done for the Semantic Markup activity of the Semantic Sensor Networks Incubator activity (at W3C).
Presentation made at the Australian Ontology Workshop, Melbourne, December 2009. The full title of the paper is: "Review of semantic enablement techniques used in geospatial and semantic standards for legacy and opportunistic mashups" (and it is available via crpit.com)
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
Panorama de l'utilisation des médias sociaux dans les collectivités localesEmilie Marquois
Présentation du Web 2.0, des territoires 2.0 et quelques exemples d'utilisation des médias sociaux dans les collectivités locales. Présentation réalisée dans le cadre des Assises Nationales des TIC 2009 (Marseille)
Open Data: From the Information Age to the Action Age (PDF with notes)Tim O'Reilly
This is the presentation I made at the UK Department for International Aid/Omidyar Network OpenUp! conference in London on November 13, 2012. I talk about open government not as a platform for transparency or citizen engagement, but for a developer ecosystem building useful services. A video of this talk is available at http://www.youtube.com/watch?feature=player_embedded&v=OIlxdpfu71o
Some Lessons for Startups (pdf with notes)Tim O'Reilly
My talk at the Stanford Technology Ventures Program on March 6, 2013. I talk about some technical and business lessons from Square, Uber, AirBnB, and the Google Autonomous Vehicle that are applicable to today's startups.
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON Byrum
Presented by Diane Mueller, ActiveState @pythondj
Are you unsure what the security and privacy implications are for sensitive corporate data? US Patriot Act is causing many of us to hesitate on leveraging the cloud.
Organizations are thinking long and hard about the legal and regulatory implications of cloud computing. When it comes to actual corporate data, no matter what the efficiency gains are, legal departments are often directing IT departments to steer clear of any service that eliminates their ability to keep potential sensitive information out of the hands of Federal prosecutors.
Despite all the hype about every application moving into the cloud, some practical patterns are starting to emerge in the types of data corporations are willing to move to the cloud.
Covered in this session:
(a) Introduction to the US Patriot Act and Data Privacy issues Implications for on Cloud Computing Jurisdictional Issues
(b) Best Practices & Practical Patterns Classes of applications that best leverage the cloud
(c)What types of applications should stay on-premise Private Cloud Model(s) Building a Compliant Cloud Strategy
For more information:
email me at dianem {at} activestate {period} com
or ping me on twitter at @pythondj
visit http://activestate.com/stackato
Le bilan mobilité permet au bénéficiaire de pouvoir faire un bilan de compétences et d’acquérir les techniques de recherche d’emploi adaptées à son projet professionnel.
C’est une démarche libre et volontaire du bénéficiaire. Elle peut être proposée par l’entreprise dans le cadre d’une mobilité interne ou d’un départ de l’entreprise.
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking OSCON Byrum
The BodyTrack project develops open source tools self tracking tools to aggregate and visualize data from diverse sources such as wearable sensors, observations from mobile apps, photos, and environmental data. Our goal is to empower individuals to explore potential environment/health interactions (food sensitivities, asthma or migraine triggers, sleep problems, etc.) and better assess strategies they think might help.
Have you ever been witness to the day to day activities that happen at a traffic signal and how lives of people around it are affected? You would have caught a glimpse of it while stopping your vehicle at a traffic signal or walking past it on the pavements/footpaths but the reality is only for those who live their life at these signals. Those innumerous vendors, beggars, eunuchs, lepers, street kids, drug addicts, prostitutes, vendors etc. for whom an entire world is restricted to these signals.
'Traffic Signal' tells that tale of about 60 odd characters who have their world centered on this place. Each of them have their own life and how they make it happen on the road is what the film is all about.
Banner: Perfect Picture Company
Cast: Neetu Chandra, Konkona Sen Sharma, Ranveer Shourey, Sudhir Mishra
Direction: Madhur Bhandarkar
Source: IndiaGlitz
Solving the Wanamaker Problem for Healthcare (keynote file)Tim O'Reilly
Finding a solution to Wanamaker's complaint, "Half of my advertising doesn't work, I just don't know which half" fueled the consumer internet revolution. We are now in the process of finding and solving a similar dilemma in healthcare. I offer some lessons from Silicon Valley for Healthcare
Localized methods for diffusions in large graphsDavid Gleich
I describe a few ongoing research projects on diffusions in large graphs and how we can create efficient matrix computations in order to determine them efficiently.
An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases.
http://www.eventbrite.com/event/7476758185
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with an introduction to one of Spark's newest features: Datasets.
Functional programming for optimization problems in Big DataPaco Nathan
Enterprise Data Workflows with Cascading.
Silicon Valley Cloud Computing Meetup talk at Cloud Tech IV, 4/20 2013
http://www.meetup.com/cloudcomputing/events/111082032/
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
10 big data analytics tools to watch out for in 2019JanBask Training
The long-standing boss in the field of Big Data processing understood for its capacities for gigantic scale information handling.
https://www.janbasktraining.com/hadoop-big-data-analytics
Agile has democratized software architecture, taking it out of the hands of the few and putting it into the hands of the many. But architecture is a complex thing, and there are lots of mines in the meadow. This presentation provides some key things to keep in mind as you contribute to the evolution of your Rails application.
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataPaco Nathan
OSCON 2013 talk in Portland about https://github.com/Cascading/CoPA project for CMU, to build a recommender system based on Open Data from City of Palo Alto. This talk examines a "lengthy" (400+ lines) Cascalog app -- which is big for Cascalog, as well as issues involved in commercial use cases for Open Data.
Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum
"Using Cascalog to build an app with City of Palo Alto Open Data" by Paco Nathan, presented at OSCON 2013 in Portland. Based on a case study from "Enterprise Data Workflows with Cascading" http://shop.oreilly.com/product/0636920028536.do
Bringing olap fully online analyze changing datasets in mem sql and spark wi...SingleStore
As the world moves from batch to online data processing, real-time data pipelines will supercede siloed data warehouse and transaction processing systems as core infrastructure.
While many analytics solutions tout query execution speed, this is only half of the equation.
For real time workloads, stale data renders query speed irrelevant when results and insights are out of date.
Beyond just “online queries,” real-time enterprises need “online datasets” that continuously update and make data accessible across the organization.
This session will cover approaches to building real-time pipelines with MemSQL, Hadoop, and Spark. Topics will include:
Key industry trends and the move to real-time data pipelines
How MemSQL customer Novus built the premier financial portfolio management platform using MemSQL as a real-time data store and query engine.
Operationalizing Spark for Advanced Analytics
Demonstration of how Pinterest is using the MemSQL Spark Connector to derive real-time insights on interesting and meaningful user activity with MemSQL and Spark.
Introduction to the MemSQL Spark Connector
Strategies for integrating Spark and Hadoop with real-time systems for transaction processing and operational analytics.
Presenters include MemSQL CEO Eric Frenkiel, Novus CTO Robert Stepeck, and Pinterest Software Engineer Yu Yang.
In a world of web portals and push notifications, users have developed demanding expectations for a real-time experience. Continuous updates, a responsive interface, and short loading times have become the norm. Most business analysts and data scientists, whose workflows remain bound by legacy tools and complex data pipelines, lack this fast, simple user experience.
From a business perspective, latency and complexity impede revenue by preventing access to the right data at the right time. Businesses that recognize the value of access to real-time data now have options to meet stringent objectives. They understand that serving “always up to date” data for analysis requires converging transactions and analytics in a real-time system. This session will highlight these architectures and customer achievements.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose.
http://meetup.com/SF-Bay-ACM/events/221693508/
Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets.
O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Pattern: an open source project for migrating predictive models onto Apache Hadoop
1. “Pattern –
an open source project for migrating
predictive models onto Apache Hadoop”
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
Copyright @2013, Concurrent, Inc.
Sunday, 17 March 13 1
2. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 2
3. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
Sunday, 17 March 13 3
4. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
Sunday, 17 March 13 4
5. functional programming… in production
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Sunday, 17 March 13 5
6. Cascading – definitions
• a pattern language for Enterprise Data Workflows
Customers
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Sunday, 17 March 13 6
7. Cascading – usage
• Java API, DSLs in Scala, Clojure,
Customers
Jython, JRuby, Groovy, ANSI SQL
• ASL 2 license, GitHub src, Web
App
http://conjars.org
• 5+ yrs production use, logs
logs
Logs
Cache
multiple Enterprise verticals Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Sunday, 17 March 13 7
8. Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Customers
Amazon AWS, MapR, EMC, SpringSource,
Cloudera Web
• taps: Memcached, Cassandra, MongoDB,
App
HBase, JDBC, Parquet, etc. logs
logs Cache
• serialization: Avro, Thrift, Kryo, Support
Logs
JSON, etc. trap
source
tap sink
tap tap
• topologies: Apache Hadoop, Data
tuple spaces, local mode Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Sunday, 17 March 13 8
9. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
Sunday, 17 March 13 9
10. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utilityworkflow abstraction
grids, telecom, addresses:
genomics, climatology, agronomics, etc.
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
Sunday, 17 March 13 10
11. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 11
12. The Ubiquitous Word Count
Document
Definition:
Collection
Tokenize
GroupBy
M token Count
count how often each word appears
count how often each word appears
R Word
Count
in a collection of text documents
in a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
• requires a minimal amount of code for each word w in segment(text):
emit(w, "1");
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):
• is not many steps away from useful search indexing int count = 0;
• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);
Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
Sunday, 17 March 13 12
13. word count – conceptual flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702
Sunday, 17 March 13 13
14. word count – Cascading app in Java
Document
Collection
String docPath = args[ 0 ]; Tokenize
GroupBy
M token
String wcPath = args[ 1 ]; Count
Properties properties = new Properties(); R Word
Count
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Sunday, 17 March 13 14
15. word count – generated flow diagram
Document
Collection
Tokenize
[head] M
GroupBy
token Count
R Word
Count
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']
GroupBy('wc')[by:['token']]
wc[{1}:'token']
[{1}:'token']
reduce
Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Sunday, 17 March 13 15
16. word count – Cascalog / Clojure
Document
Collection
(ns impatient.core M
Tokenize
GroupBy
token Count
(:use [cascalog.api] R Word
Count
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Sunday, 17 March 13 16
17. word count – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
Sunday, 17 March 13 17
18. word count – Scalding / Scala
Document
Collection
import com.twitter.scalding._ M
Tokenize
GroupBy
token Count
R Word
Count
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
Sunday, 17 March 13 18
19. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
Sunday, 17 March 13 19
20. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
• less learning curve than Cascalog
Sunday, 17 March 13 20
21. Two Avenues to the App Layer…
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞
Sunday, 17 March 13 21
22. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 22
23. workflow abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Data is represented as flows of tuples. Operations within Word
the flows bring functional programming aspects into Java Count
In formal terms, this provides a pattern language
Sunday, 17 March 13 23
24. references…
pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices
amazon.com/dp/0195019199
design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”
amazon.com/dp/0201633612
Sunday, 17 March 13 24
25. workflow abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
design principles of the pattern
token
M
language ensure best practices
Stop Word
List
HashJoin
Left
Regex
token
GroupBy
token
R
for robust, parallel data workflows
RHS
at scale Count
Data is represented as flows of tuples. Operations within Word
the flows bring functional programming aspects into Java Count
In formal terms, this provides a pattern language
Sunday, 17 March 13 25
26. workflow abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
In formal terms, flow diagrams leverage a methodology Word
Count
called literate programming
Provides intuitive, visual representations for apps –
great for cross-team collaboration
Sunday, 17 March 13 26
27. references…
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”
Sunday, 17 March 13 27
28. workflow abstraction – test-driven development
• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions” Web
App
• TDD at scale:
1. start from raw inputs in the flow graph logs
logs
Logs
Cache
2. define stream assertions for each stage Support
source
trap sink
of transforms tap
tap
tap
3. verify exceptions, code to remove them Modeling PMML
Data
Workflow
4. when impl is complete, app has full sink
source
tap
tap
test coverage Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
redirect traps in production Reporting
Cluster
to Ops, QA, Support, Audit, etc.
Sunday, 17 March 13 28
29. workflow abstraction – business process
Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
Sunday, 17 March 13 29
30. references…
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data
Closely related to functional relational programming paradigm:
“Out of the Tar Pit”
Moseley & Marks 2006
http://goo.gl/SKspn
Sunday, 17 March 13 30
31. workflow abstraction – API design principles
• specify what is required, not how it must be achieved
• plan far ahead, before consuming cluster resources –
fail fast prior to submit
• fail the same way twice – deterministic flow planners
help reduce engineering costs for debugging at scale
• same JAR, any scale – app does not require a recompile
to change data taps or cluster topologies
Sunday, 17 March 13 31
32. workflow abstraction – building apps in layers
business separation of concerns: focus on specifying what is required, not how the computers
process
must accomplish it – not unlike BPM/BPEL for BigData
test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
development code until tests pass, repeat … route exceptional data to appropriate department
pattern syntax of the pattern language conveys expertise – much like building a tower with
language
Lego blocks: ensure best practices for robust, parallel data workflows at scale
flow planner/ enables the functional programming aspects: compiler within a compiler, mapping
optimizer flows to topologies (e.g., create and sequence Hadoop job steps)
compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for
build troubleshooting, exception handling, notifications, etc.; one app = one JAR
topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
JVM cluster cluster scheduler, instrumentation, etc.
Sunday, 17 March 13 32
33. workflow abstraction – building apps in layers
business separation of concerns: focus on specifying what is required, not how the computers
process
must accomplish it – not unlike BPM/BPEL for BigData
test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
development code until tests pass, repeat … route exceptional data to appropriate department
pattern syntax of the pattern language conveys expertise – much like building a tower with
language
Lego blocks: ensure best practices for robust, parallel data workflows at scale
flow planner/
optimizer
several theoretical aspects converge
enables the functional programming aspects: compiler within a compiler, mapping
flows to topologies
into software engineering practices
entire app is visible to the compiler: resolves issues of crossing boundaries for
compiler/
build which minimize the complexity of
troubleshooting, exception handling, notifications, etc.; one app = one JAR
building and maintaining Enterprise
topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
data workflows
JVM cluster cluster scheduler, instrumentation, etc.
Sunday, 17 March 13 33
34. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 34
35. Pattern – analytics workflows
• open source project – ASL 2, GitHub repo
• multiple companies contributing
• complementary to Apache Mahout – while leveraging
workflow abstraction, multiple topologies, etc.
• model scoring: generates workflows from PMML models
• model creation: estimation at scale, captured as PMML
• use sample Hadoop app at scale – no coding required
• integrate with 2 lines of Java (1 line Clojure or Scala)
• excellent use cases for customer experiments at scale
cascading.org/pattern
Sunday, 17 March 13 35
36. Pattern – analytics workflows
• open source project – ASL 2, GitHub repo
• multiple companies contributing
• complementary to Apache Mahout – while leveraging
workflow abstraction, multiple topologies, etc.
• model scoring: generates workflows from PMML models
• model creation: estimation at reduced development
greatly scale, captured at PMML costs, less
• use sample Hadoop app at scale – no coding required leveraging the
licensing issues at scale –
• economics of Apache Hadoop clusters,
integrate with 2 lines of Java (1 line Clojure or Scala)
• excellent use cases for customer experiments at scale of analytics
plus the core competencies
staff, plus existing IP in predictive models
cascading.org/pattern
Sunday, 17 March 13 36
37. Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML Customers
• great open source tools – R, Weka, Web
App
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries – logs
logs Cache
Logs
Matrix API, etc. Support
• leverage PMML as another kind trap
tap
source
tap sink
tap
of DSL
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
cascading.org/pattern
Sunday, 17 March 13 37
38. Pattern – an example classifier
1. use customer order history as the training data set
2. train a risk classifier for orders, using Random Forest risk classifier
dimension: customer 360
risk classifier
dimension: per-order
Cascading apps
3. export model from R to PMML data prep
training
data sets
analyst's
laptop
customer
transactions
predict score new
4. build a Cascading app to execute the PMML model model costs
detect
PMML
model
orders
anomaly
fraudsters detection
4.1. generate flow from PMML description segment
customers
velocity
metrics
4.2. plan the flow for a topology (Hadoop) Hadoop
batch
Customer
DB
real-time
IMDG
workloads workloads
4.3. compile app to a JAR file
ETL
chargebacks, partner
DW etc. data
5. verify results with a regression test
6. deploy the app at scale to calculate scores
7. potentially, reuse classifier for real-time scoring
Sunday, 17 March 13 38
39. Pattern – an example classifier
risk classifier risk classifier
dimension: customer 360 dimension: per-order
Cascading apps
training analyst's customer
data prep laptop
data sets transactions
predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection
segment velocity
customers metrics
Hadoop Customer IMDG
DB
batch real-time
workloads workloads
ETL
chargebacks, partner
DW etc. data
Sunday, 17 March 13 39
40. Pattern – create a model in R
## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Sunday, 17 March 13 40
42. Pattern – score a model, within an app
public class Main {
public static void main( String[] args ) {
String pmmlPath = args[ 0 ];
String ordersPath = args[ 1 ];
String classifyPath = args[ 2 ];
String trapPath = args[ 3 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );
// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( classifyPipe, ordersTap )
.addTrap( classifyPipe, trapTap )
.addSink( classifyPipe, classifyTap );
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
}
Sunday, 17 March 13 42
43. Pattern – score a model, using pre-defined Cascading app
Customer
Orders
Scored GroupBy
Classify Assert
Orders token
M R
PMML
Model
Count
Failure Confusion
Traps Matrix
Sunday, 17 March 13 43
44. Pattern – score a model, using pre-defined Cascading app
## run an RF classifier at scale
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml
## run an RF classifier at scale, assert regression test, measure confusion matrix
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure
## run a predictive model at scale, measure RMSE
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure
Sunday, 17 March 13 44
46. Lingual – connecting Hadoop and R
# load the JDBC package
library(RJDBC)
# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")
# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)
# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
Sunday, 17 March 13 46
47. Lingual – connecting Hadoop and R
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
cascading.org/lingual
launchpad.net/test-db
Sunday, 17 March 13 47
48. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 48
49. PMML – standard
• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
wikipedia.org/wiki/Predictive_Model_Markup_Language
Sunday, 17 March 13 49
50. PMML – models
• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classifiers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• Support Vector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
ibm.com/developerworks/industry/library/ind-PMML2/
Sunday, 17 March 13 50
52. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 52
53. roadmap – existing algorithms for scoring
•
Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Support Vector Machines
cascading.org/pattern
Sunday, 17 March 13 53
54. roadmap – top priorities for creating models at scale
•
Random Forest
• Logistic Regression
• K-Means Clustering
a wealth of recent research indicates many opportunities
to parallelize popular algorithms for training models at scale
on Apache Hadoop…
cascading.org/pattern
Sunday, 17 March 13 54
55. roadmap – next priorities for scoring
•
Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact @pacoid
cascading.org/pattern
Sunday, 17 March 13 55
56. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 56
57. experiments – comparing models
• much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
Sunday, 17 March 13 57
58. experiments – Random Forest model
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478
Sunday, 17 March 13 58
59. experiments – Logistic Regression model
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
Sunday, 17 March 13 59
60. experiments – comparing results
•
use a confusion matrix to compare results for the classifiers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classifier:
FN ∼ chargeback risk
FP ∼ customer support costs
Sunday, 17 March 13 60
61. references…
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721
Sunday, 17 March 13 61
62. drill-down…
blog, dev community, code/wiki/gists, maven repo,
commercial products, career opportunities:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
Copyright @2013, Concurrent, Inc.
Sunday, 17 March 13 62