Fully featured, commercially supported machine learning suites that can build Decision Trees in Hadoop are few and far between. Addressing this gap, Revolution Analytics recently enhanced its entire scalable analytics suite to run in Hadoop. In this talk, I will explain how our Decision Tree implementation exploits recent research reducing the computational complexity of decision tree estimation, allowing linear scalability with data size and number of nodes. This streaming algorithm processes data in chunks, allowing scaling unconstrained by aggregate cluster memory. The implementation supports both classification and regression and is fully integrated with the R statistical language and the rest of our advanced analytics and machine learning algorithms, as well as our interactive Decision Tree visualizer.
Fully featured, commercially supported machine learning suites that can build Decision Trees in Hadoop are few and far between. Addressing this gap, Revolution Analytics recently enhanced its entire scalable analytics suite to run in Hadoop. In this talk, I will explain how our Decision Tree implementation exploits recent research reducing the computational complexity of decision tree estimation, allowing linear scalability with data size and number of nodes. This streaming algorithm processes data in chunks, allowing scaling unconstrained by aggregate cluster memory. The implementation supports both classification and regression and is fully integrated with the R statistical language and the rest of our advanced analytics and machine learning algorithms, as well as our interactive Decision Tree visualizer.
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...Rob Cottingham
This ebook is for anyone who's been told to cut the blog from their communications proposal...
...or who knows their social media activities could pull more of their own weight on the bottom line...
...or who wants to take their blog from the experimental stage to having real-world impact - and real-world value.
From the social media experts at Social Signal.
Intro to Data Science for Enterprise Big DataPaco Nathan
If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com
An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.
Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/
Presentation given at DocEng 2006, ACM Symposium on Document Engineering, October 2006
ABSTRACT: Citations form the basis for a web of scientific publications. Search engines, embedded hyperlinks and digital libraries all simplify the task of finding publications of interest on the web and navigating to cited publications or web sites. However the actual reading of publications often takes place on paper and frequently on the move. We present a system Print-n-Link that uses technologies for interactive paper to enhance the reading process by enabling users to access digital information and/or searches for cited documents from a printed version of a publication using a digital pen for interaction. A special virtual printer driver automatically generates links from paper to digital services during the printing process based on an analysis of PDF documents. Depending on the user setting and interaction gesture, the system may retrieve metadata about the citation and inform the user through an audio channel or directly display the cited document on the user’s screen.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...Rob Cottingham
This ebook is for anyone who's been told to cut the blog from their communications proposal...
...or who knows their social media activities could pull more of their own weight on the bottom line...
...or who wants to take their blog from the experimental stage to having real-world impact - and real-world value.
From the social media experts at Social Signal.
Intro to Data Science for Enterprise Big DataPaco Nathan
If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com
An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.
Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/
Presentation given at DocEng 2006, ACM Symposium on Document Engineering, October 2006
ABSTRACT: Citations form the basis for a web of scientific publications. Search engines, embedded hyperlinks and digital libraries all simplify the task of finding publications of interest on the web and navigating to cited publications or web sites. However the actual reading of publications often takes place on paper and frequently on the move. We present a system Print-n-Link that uses technologies for interactive paper to enhance the reading process by enabling users to access digital information and/or searches for cited documents from a printed version of a publication using a digital pen for interaction. A special virtual printer driver automatically generates links from paper to digital services during the printing process based on an analysis of PDF documents. Depending on the user setting and interaction gesture, the system may retrieve metadata about the citation and inform the user through an audio channel or directly display the cited document on the user’s screen.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose.
http://meetup.com/SF-Bay-ACM/events/221693508/
Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets.
O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
1. “The Workflow Abstraction”
Strata SC
2013-02-28
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
Copyright @2013, Concurrent, Inc.
Friday, 01 March 13 1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
2. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 2
This talk is about the workflow abstraction:
* the business process of structuring data
* the practices of building robust apps at scale
* the open source projects for Enterprise Data Workflows
We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?
Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
3. Marketing Funnel – overview
In reference to Making Data Work…
Customers
Almost every business uses a model
similar to this – give or take a few steps. Campaigns
Customer leads go in at the top,
Awareness
those get refined through several stages,
then results flow out the bottom.
Interest
Evalutation
Conversion
Referral
Repeat
Friday, 01 March 13 3
Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.
This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
4. Marketing Funnel – clickstream
Different funnel stages get represented
in ecommerce by events captured in Customers
log files, as a class of machine data
called clickstream Campaigns
Impression
• ad impressions Awareness
• URL clicks Click
• landing page views Interest
• new user registrations Sign Up
Evalutation
• session cookies
Purchase
• online purchases Conversion
• social network activity "Like"
• etc. Referral
Repeat
Friday, 01 March 13 4
Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
5. Marketing Funnel – metrics
A variety of clickstream metrics can
be used as performance indicators Customers
at different stages of the funnel:
Campaigns
• CPM: cost per thousand Impression
• CTR: click-through rate Awareness CPM
• CPA: cost per action Click
• etc. Interest CTR
Sign Up
Evalutation behaviors
Purchase
Conversion CPA
"Like"
Referral NPS, social graph, etc.
Repeat loyalty, win back, etc.
Friday, 01 March 13 5
The many different highly-nuanced metrics which apply are mind-boggling :)
6. Marketing Funnel – example calculations Customers
Campaigns
Awareness
Interest
metric cost events formula rate Evalutation
Conversion
Referral
Repeat
$4,000
CPM $4,000 10^6 ÷ $4.00
(10^6 ÷ 10^3)
3∙10^3
CTR - 3∙10^3
÷ 10^6
0.3%
$4,000
CPA - 20 ÷ $200
20
Friday, 01 March 13 6
Here are examples of the kinds of calculations performed...
7. Marketing Funnel – predictive model
Given these metrics, we can go further
to estimate cost per paying user (CPP) Customers
customer lifetime value (LTV), etc.
Campaigns
Then we can build a predictive model for
return on investment (ROI) per customer, Awareness
summarizing the funnel performance:
ROI = (LTV − CPP) ∕ CPP Interest
As an example, after crunching lots of logs, Evalutation
suppose that…
Conversion
CPP = $200
LTV = $2000 Referral
ROI = ($2000 − $200) ∕ $200
Repeat
for a 9x multiple
Friday, 01 March 13 7
For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,
which describes the efficiency of the marketing funnel at different stages.
8. Marketing Funnel – example architecture Customers
Campaigns
Customers
Awareness
Let’s consider an example architecture Interest
Evalutation
for calculating, reporting, and taking action Web
Conversion
on funnel metrics, based on large-scale App
Referral
Repeat
clickstream data…
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 8
Here’s an example architecture of using clickstream metrics within an online business.
9. Marketing Funnel – complexities
Multiple ad partners, different contracts
terms, reporting different metrics at Customers
×
×
different times, click scrubs, etc.
Campaigns
Campaigns target specific geo/demo, Impression
× ×
test alternate landing pages, probably Awareness CPM
need to segment customer base… Click
These issues make clickstream data Interest CTR
large and yet sparse. Sign Up
Evalutation behaviors
Other issues:
×
Purchase
• seasonal variation Conversion CPA
• fluctuating currency exchange rates "Like"
Referral NPS, social graph, etc.
• distortions due to credit card fraud
• diminishing returns Repeat loyalty, win back, etc.
• forecasting requirements
Friday, 01 March 13 9
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.
scrubs
many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards
social dimension makes this convoluted
not simple
10. Marketing Funnel – very large scale
Even a small start-up may need to
make decisions about billions of Customers
events, many millions of users, and
millions of dollars in annual ad spend. Campaigns
Impression
Ad networks attempt to simplify and Awareness CPM
optimize parts of the funnel process Click
as a value-add. Interest CTR
The need for these insights has been a Sign Up
driver for Hadoop-related technologies. Evalutation behaviors
Purchase
Conversion CPA
"Like"
Referral NPS, social graph, etc.
Repeat loyalty, win back, etc.
Friday, 01 March 13 10
The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
11. Marketing Funnel – very large scale
Even a small start-up may need to
make decisions about billions of Customers
events, many millions of users, and
millions of dollars in annual ad spend. Campaigns
Impression
Ad networks attempt to simplify and Awareness CPM
optimize parts of the funnel process Click
as a value-add.
funnel modeling and optimization Interest CTR
The need for these insights has been a Sign Up
driver for Hadoop-relatedrequires complex data workflows
technologies. Evalutation behaviors
to obtain the required insights Purchase
Conversion CPA
"Like"
Referral NPS, social graph, etc.
Repeat loyalty, win back, etc.
Friday, 01 March 13 11
These needs imply complex data workflows.
It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.
12. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 12
A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
13. Circa 2008 – Hadoop at scale
Customers
Scenario: Analytics team at a large ad network… Campaigns
Awareness
Company had invested $MM capex in a Interest
large data warehouse across LOBs Evalutation
Conversion
Mission-critical app had been written as
Referral
collab Repeat
a large SQL workflow in the DW roll-ups
filter
Marketing funnel metrics were estimated
for many advertisers, many campaigns, per-user
recommends
many publishers, many customers –
billions of calculations daily
query/load
Predictive models matched publisher ~ advertiser clickstream RDBMS
and campaign ~ user, to optimize marketing
funnel performance
Friday, 01 March 13 13
Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..
Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
14. Circa 2008 – Hadoop at scale
Customers
Issues: Campaigns
Awareness
• critical app had hit hard limits for scalability Interest
• several Tb data, 100’s of servers
Evalutation
Conversion
• batch window length vs. failure rate vs. SLA collab
Referral
Repeat
in the context of business growth posed roll-ups
filter
an existential risk
×
We built out a team to address these issues per-user
recommends
as rapidly as possible…
Needed to re-create that data workflows query/load
based on Enterprise requirements. clickstream RDBMS
Friday, 01 March 13 14
Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.
We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
15. Circa 2008 – Hadoop at scale
Approach: roll-ups
collab
filter
• reverse-engineered business process from
~1500 lines of undocumented SQL
per-user
• created a large, multi-step Apache Hadoop recommends
app on AWS HDFS
• leveraged cloud strategy to trade $MM
capex for lower, scalable opex
• Amazon identified our app as one of the msg
queue
largest Hadoop deployments on EC2
• our app became a case study for AWS query/load
RDBMS
prior to Elastic MapReduce launch clickstream
Friday, 01 March 13 15
Our solution involved dependencies among more than a dozen Hadoop job steps.
16. Circa 2008 – Hadoop at scale
×
Unresolved: roll-ups
collab
filter
• ETL was still a separate app
• difficult to handle exceptions, notifications, per-user
debugging, etc., across the entire workflow recommends
HDFS
• data scientists wore beepers since Ops
× ×
lacked visibility into business process
• coding directly in MapReduce created
a staffing bottleneck msg
queue
query/load
clickstream RDBMS
Friday, 01 March 13 16
This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.
Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.
Three issues about Enterprise workflows:
* staffing bottleneck unless there’s a good abstraction layer
* operational complexity, mostly due to lack of transparency
* system integration problems *are* the main problem to solve
17. Circa 2008 – Hadoop at scale
Unresolved: roll-ups
collab
filter
• ETL was still a separate app
• difficult to handle exceptions, notifications, per-user
debugging, etc., across the entire workflow recommends
• data scientists worea good since Ops for a large, commercial
beepers solution
HDFS
lacked visibility into Apachebusiness logic deployment, but
the app’s Hadoop
• coding directly in MapReduce created
a staffing bottleneck workflow management lacked crucial
msg
queue
features…
query/load
which led to a search for a better clickstream RDBMS
workflow abstraction
Friday, 01 March 13 17
While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.
I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
18. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 18
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
19. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for several popular
data products.
Wensel was following the Nutch open source project –
before Hadoop even had a name.
He noted that it would become difficult to find Java
developers to write complex Enterprise apps directly
in Apache Hadoop – a potential blocker for leveraging
this new open source technology.
Friday, 01 March 13 19
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name
API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
20. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without an need
to create an entirely new language
• allows many programmers who have J2EE expertise
to build apps that leverage the economics of Hadoop
clusters
Friday, 01 March 13 20
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
21. quotes…
“Cascading gives Java developers the ability to build
Big Data applications on Hadoop using their existing
skillset … Management can really go out and build a
team around folks that are already very experienced
with Java. Switching over to this is really a very short
exercise.”
CIO, Thor Olavsrud
2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading
“Masks the complexity of MapReduce, simplifies the
programming, and speeds you on your journey toward
actionable analytics … A vast improvement over native
MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck
2012-09-18
infoworld.com/slideshow/65089
Friday, 01 March 13 21
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”
The issues:
* staffing bottleneck
* operational complexity
* system integration
22. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
• partners: Amazon AWS, Microsoft Azure, Hortonworks,
MapR, EMC, SpringSource, Cloudera
• 5+ history of Enterprise production deployments,
ASL 2 license, GitHub src, http://conjars.org
• use cases: ETL, marketing funnel, anti-fraud, social media,
retail pricing, search analytics, recommenders, eCRM,
utility grids, genomics, climatology, etc.
Friday, 01 March 13 22
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.
Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
23. examples…
• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Friday, 01 March 13 23
Many case studies, many Enterprise production deployments now for 5+ years.
24. examples…
• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascading as the basis for workflow
abstractions atop Hadoop and more,
Cascalog in Clojure (2010)
Scalding in Scala (2012)
with a 5+ year history of production
deployments across multiple verticals
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Friday, 01 March 13 24
Cascading as a basis for workflow abstraction, for Enterprise data workflows
25. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 25
Code samples in Cascading / Cascalog / Scalding, based on Word Count
26. The Ubiquitous Word Count
Document
Collection
Definition: M
Tokenize
GroupBy
token Count
count how often each word appears
count how often each word appears
R Word
Count
inin a collection of text documents
a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
for each word w in segment(text):
• requires a minimal amount of code emit(w, "1");
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):
• is not many steps away from useful search indexing
int count = 0;
• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);
Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
Friday, 01 March 13 26
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
27. word count – conceptual flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702
Friday, 01 March 13 27
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
28. word count – Cascading app in Java
Document
Collection
String docPath = args[ 0 ]; Tokenize
GroupBy
token
String wcPath = args[ 1 ]; M Count
Properties properties = new Properties(); R Word
Count
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Friday, 01 March 13 28
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop
2nd to last line: generates a DOT file for the flow diagram
29. word count – generated flow diagram
Document
Collection
Tokenize
[head] M
GroupBy
token Count
R Word
Count
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']
GroupBy('wc')[by:['token']]
wc[{1}:'token']
[{1}:'token']
reduce
Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Friday, 01 March 13 29
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
30. word count – Cascalog / Clojure
Document
Collection
(ns impatient.core M
Tokenize
GroupBy
token Count
(:use [cascalog.api] R Word
Count
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Friday, 01 March 13 30
Here is the same Word Count app written in Clojure, using Cascalog.
31. word count – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
Friday, 01 March 13 31
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.
Great for large-scale, complex apps, where small teams must limit the complexities in their process.
32. word count – Scalding / Scala
Document
Collection
import com.twitter.scalding._ M
Tokenize
GroupBy
token Count
R Word
Count
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
Friday, 01 March 13 32
Here is the same Word Count app written in Scala, using Scalding.
Very compact, easy to understand; however, also more imperative than Cascalog.
33. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog,
not as much of a high-level language
Friday, 01 March 13 33
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
34. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping to limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
(imagine SOA infra @ Google as an open source project)
• less learning curve than Cascalog,
not as much of a high-level language
Friday, 01 March 13 34
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
35. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 35
Tracking back to the Marketing Funnel as an example workflow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop
36. Enterprise Data Workflows
Customers
Back to our marketing funnel, let’s consider
an example app… at the front end Web
App
LOB use cases drive demand for apps
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 36
LOB use cases drive the demand for Big Data apps
37. Enterprise Data Workflows
Customers
An example… in the back office
Organizations have substantial investments Web
App
in people, infrastructure, process
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 37
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
38. Enterprise Data Workflows
Customers
An example… for the heavy lifting!
“Main Street” firms are migrating Web
App
workflows to Hadoop, for cost
savings and scale-out
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 38
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
39. Cascading workflows – taps
• taps integrate other data frameworks, as tuple streams
Customers
• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions) Web
App
• text delimited, JDBC, Memcached,
HBase, Cassandra, MongoDB, etc. logs
logs
Logs
Cache
• data serialization: Avro, Thrift,
Support
source
trap sink
tap
Kryo, JSON, etc. tap tap
• extend a new kind of tap in just
Data
Modeling PMML
Workflow
a few lines of Java sink
source
tap
tap
Analytics
Cubes customer
Customer
profile DBs
schema and provenance get Hadoop
Prefs
derived from analysis of the taps Reporting
Cluster
Friday, 01 March 13 39
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
40. Cascading workflows – taps
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Friday, 01 March 13 40
Here are the taps in the WordCount source
41. Cascading workflows – topologies
• topologies execute workflows on clusters
Customers
• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs) Web
App
- local mode (dev/test or special config)
logs Cache
- in-memory data grids (real-time) logs
Logs
Support
• flow planner can be extended trap
tap
source
tap sink
tap
to support other topologies
Data
Modeling PMML
Workflow
source
sink
tap
blend flows in different topologies tap
Analytics
into the same app – for example, Cubes customer
Customer
profile DBs
batch (Hadoop) + transactions (IMDG) Hadoop
Prefs
Cluster
Reporting
Friday, 01 March 13 41
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
42. Cascading workflows – topologies
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe ); topology
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Friday, 01 March 13 42
Here is the flow planner for Hadoop in the WordCount source
43. example topologies…
Friday, 01 March 13 43
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.
Several other widely used platforms would also be likely suspects for Cascading flow planners.