LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.
In this talk, we’ll examine the practical aspects of building a data feature like Endorsements. We’ll talk about marrying product design and data, deep diving into several of the lessons we’ve learned along the way - all using skills & endorsements as an empirical case study. We’ll include technical detail on our approaches and how we combine crowdsourcing, machine learning, and large scale distributed systems to recommend topics to users.
Amazon EC2 may offer the possibility of high performance computing to programmers on a budget. Instead of building and maintaining a permanent Beowulf cluster, we can launch a cluster on-demand using Python and EC2. This talk will cover the basics involved in getting your own cluster running using Python, demonstrate how to run some large parallel computations using Python MPI wrappers, and show some initial results on cluster performance.
Building Scale Free Applications with Hadoop and Cascadingcwensel
Many more applications are suitable to be built on Apache Hadoop than many developers realize.
In this presentation, we hope to give attendees enough information on how Hadoop works, how MapReduce can be leveraged to perform common and well understood data processing operations, and how the Cascading open-source project helps developers rapidly build sophisticated Hadoop applications that can be simply tested locally and executed remotely.
Presented at OSBridge June 2009.
LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.
In this talk, we’ll examine the practical aspects of building a data feature like Endorsements. We’ll talk about marrying product design and data, deep diving into several of the lessons we’ve learned along the way - all using skills & endorsements as an empirical case study. We’ll include technical detail on our approaches and how we combine crowdsourcing, machine learning, and large scale distributed systems to recommend topics to users.
Amazon EC2 may offer the possibility of high performance computing to programmers on a budget. Instead of building and maintaining a permanent Beowulf cluster, we can launch a cluster on-demand using Python and EC2. This talk will cover the basics involved in getting your own cluster running using Python, demonstrate how to run some large parallel computations using Python MPI wrappers, and show some initial results on cluster performance.
Building Scale Free Applications with Hadoop and Cascadingcwensel
Many more applications are suitable to be built on Apache Hadoop than many developers realize.
In this presentation, we hope to give attendees enough information on how Hadoop works, how MapReduce can be leveraged to perform common and well understood data processing operations, and how the Cascading open-source project helps developers rapidly build sophisticated Hadoop applications that can be simply tested locally and executed remotely.
Presented at OSBridge June 2009.
Open Data: From the Information Age to the Action Age (Keynote File)Tim O'Reilly
This is the presentation I made at the UK Department for International Aid/Omidyar Network OpenUp! conference in London on November 13, 2012. I talk about open government not as a platform for transparency or citizen engagement, but for a developer ecosystem building useful services. A video of this talk is available at http://www.youtube.com/watch?feature=player_embedded&v=OIlxdpfu71o
My talk at the Stanford Technology Ventures Program on March 6, 2013. I talk about some technical and business lessons from Square, Uber, AirBnB, and the Google Autonomous Vehicle that are applicable to today's startups.
Localized methods for diffusions in large graphsDavid Gleich
I describe a few ongoing research projects on diffusions in large graphs and how we can create efficient matrix computations in order to determine them efficiently.
Mesosphere lightening talk presented at the first Mesos Townhall Meeting 2013-11-19 https://www.eventbrite.com/e/mesostownhall-meeting-1119-tickets-9104464699
Digital analytics & privacy: it's not the end of the worldOReillyStrata
This presentation starts by revisiting the common best practices related to digital analytics in order to measure digital asset’s effectiveness to increase conversion, common data feeds between tools and possibly data flows between continents for analysis.
These practices are then put in parallel with legal requirements, showing which steps need to be undertaken to assure legal compliance of said practices, how digital responsibles should be trained in data protection matters and what contracts are needed with both data providers & collectors so as to assure minimal liability for these routinely undertaken tasks.
This presentation is NOT about security and goes beyond the over-blown cookie debate in order to highlight how the upcoming EU Personal Data Protection Regulation will influence digital analytics to hopefully start embracing Privacy by Design ways of working.
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Wesley Schwalje
The United Nations University’s Maastricht Economic and Social Research Institute on Innovation and Technology cited Tahseen Consulting's Wes Schwalje's research on knowledge-based economies in analyzing knowledge transfer in the MENA countries.
Mobilité partagée, un enjeu d'innovation dans un système global de transportPierre-Olivier Desmurs
Contraintes budgétaires des ménages, congestion et pollution des villes, essor des services de partage ... autant de tendances qui déjà en 2012 interrogeaient l'industrie automobile sur la pertinence de son modèle à l'ère de l'usage.
Nomadvise - 27 novembre 2012
I talk about the evolution of digital content into services, the role of sensors in the future of the web, about the idea of man-machine collaboration in internet services, and about the role of social networking in building content.
Terrorism, discrimination based on reservation, injustice, corruption, crime, unemployment, poverty, pollution - problems of India are multiplying every day, but there is nobody coming forward to offer a solution.
And we common Indians carry on our lives as usual hoping that something good will happen someday on its own.
Let us realize that nothing will change till we, the common people, wake up. To bring about this change, we need to clean the system by being a part of it and not as an outsider. Let us all rise to this occasion.
We admire A. P. J. Abdul Kalam, T. N. Shesan, Narayan Murthy, K. P. S. Gill, Kurien Verghese, E. Sridharan, Kiran Bedi, Joginder Singh, Dr. Jai Prakash Narayan, Arvind Kejriwal, Anna Hazare, Aruna Roy, Sundeep Pandey and similar other leaders for their selfless contribution to the nation.
We have launched this political organization called JAGO PARTY to initiate this cleaning action and to say boldly that enough is enough.
Intro to Data Science for Enterprise Big DataPaco Nathan
If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com
An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.
Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
Open Data: From the Information Age to the Action Age (Keynote File)Tim O'Reilly
This is the presentation I made at the UK Department for International Aid/Omidyar Network OpenUp! conference in London on November 13, 2012. I talk about open government not as a platform for transparency or citizen engagement, but for a developer ecosystem building useful services. A video of this talk is available at http://www.youtube.com/watch?feature=player_embedded&v=OIlxdpfu71o
My talk at the Stanford Technology Ventures Program on March 6, 2013. I talk about some technical and business lessons from Square, Uber, AirBnB, and the Google Autonomous Vehicle that are applicable to today's startups.
Localized methods for diffusions in large graphsDavid Gleich
I describe a few ongoing research projects on diffusions in large graphs and how we can create efficient matrix computations in order to determine them efficiently.
Mesosphere lightening talk presented at the first Mesos Townhall Meeting 2013-11-19 https://www.eventbrite.com/e/mesostownhall-meeting-1119-tickets-9104464699
Digital analytics & privacy: it's not the end of the worldOReillyStrata
This presentation starts by revisiting the common best practices related to digital analytics in order to measure digital asset’s effectiveness to increase conversion, common data feeds between tools and possibly data flows between continents for analysis.
These practices are then put in parallel with legal requirements, showing which steps need to be undertaken to assure legal compliance of said practices, how digital responsibles should be trained in data protection matters and what contracts are needed with both data providers & collectors so as to assure minimal liability for these routinely undertaken tasks.
This presentation is NOT about security and goes beyond the over-blown cookie debate in order to highlight how the upcoming EU Personal Data Protection Regulation will influence digital analytics to hopefully start embracing Privacy by Design ways of working.
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Wesley Schwalje
The United Nations University’s Maastricht Economic and Social Research Institute on Innovation and Technology cited Tahseen Consulting's Wes Schwalje's research on knowledge-based economies in analyzing knowledge transfer in the MENA countries.
Mobilité partagée, un enjeu d'innovation dans un système global de transportPierre-Olivier Desmurs
Contraintes budgétaires des ménages, congestion et pollution des villes, essor des services de partage ... autant de tendances qui déjà en 2012 interrogeaient l'industrie automobile sur la pertinence de son modèle à l'ère de l'usage.
Nomadvise - 27 novembre 2012
I talk about the evolution of digital content into services, the role of sensors in the future of the web, about the idea of man-machine collaboration in internet services, and about the role of social networking in building content.
Terrorism, discrimination based on reservation, injustice, corruption, crime, unemployment, poverty, pollution - problems of India are multiplying every day, but there is nobody coming forward to offer a solution.
And we common Indians carry on our lives as usual hoping that something good will happen someday on its own.
Let us realize that nothing will change till we, the common people, wake up. To bring about this change, we need to clean the system by being a part of it and not as an outsider. Let us all rise to this occasion.
We admire A. P. J. Abdul Kalam, T. N. Shesan, Narayan Murthy, K. P. S. Gill, Kurien Verghese, E. Sridharan, Kiran Bedi, Joginder Singh, Dr. Jai Prakash Narayan, Arvind Kejriwal, Anna Hazare, Aruna Roy, Sundeep Pandey and similar other leaders for their selfless contribution to the nation.
We have launched this political organization called JAGO PARTY to initiate this cleaning action and to say boldly that enough is enough.
Intro to Data Science for Enterprise Big DataPaco Nathan
If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com
An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.
Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
Similar to Using Cascalog to build an app based on City of Palo Alto Open Data (11)
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose.
http://meetup.com/SF-Bay-ACM/events/221693508/
Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets.
O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Using Cascalog to build an app based on City of Palo Alto Open Data
1. “Using Cascalog to build
an app based on
City of Palo Alto Open Data”
Paco Nathan Document
Collection
Tokenize
Scrub
token
Concurrent, Inc.
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
San Francisco, CA Count
@pacoid
Word
Count
Copyright @2013, Concurrent, Inc.
Monday, 28 January 13 1
2. This project began as a machine
learning workshop for a graduate
seminar at CMU West
Many thanks to:
Stuart Evans,
CMU Distinguished Service Professor
Jonathan Reichental,
City of Palo Alto CIO
We use Cascalog to develop
a Big Data workflow
Open Source:
github.com/Cascading/CoPA/wiki
Monday, 28 January 13 2
3. Palo Alto is generally quite
a pleasant place
• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded
• friendly VCs (sort of)
On a nice summer day, who wants
to be stuck indoors on a phone call?
Instead, take it outside –
go for a walk
Monday, 28 January 13 3
4. Surely, there must be
an app for that…
But wait, there isn’t?
So let’s build one!
source: Apple
Monday, 28 January 13 4
6. 1. unstructured data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. unstructured data about where people like to walk
(smartphone GPS logs)
✚ Document
Collection
Scrub
Tokenize
token
3. a wee bit o’ curated metadata
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sippin’ a latte or enjoying some fro-yo.”
Monday, 28 January 13 6
7. “unstructured” vs. “structured” data
is actually quite a Big Debate
refer back to Edgar Codd 1969
to learn about the Relational Model
relational != SQL
but I digress…
Monday, 28 January 13 7
8. Data Science work must focus on
the process of structuring data
which must occur long before the
large-scale joins, predictive models,
visualizations, etc.
So, the process of structuring data is
what we examine here:
i.e., how to build workflows
for Big Data
thank you Dr. Codd
“A relational model of data for large shared data banks”
dl.acm.org/citation.cfm?id=362685
Monday, 28 January 13 8
9. references
by DJ Patil
Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
Monday, 28 January 13 9
10. references
by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L
also check out RStudio:
rstudio.org/
rpubs.com/
Monday, 28 January 13 10
11. Generally speaking, we could approach the matter of developing
an Open Data app through these steps:
• clean up the raw, unstructured data from CoPA download (ETL)
• before modeling, perform visualization and analysis in RStudio
• spend time on ideation and research for potential use cases
• iterate on business process for the app workflow
• integrate with use cases represented by the workflow taps
• apply best practices and TDD at scale
• …PROFIT!
source: South Park
Monday, 28 January 13 11
12. edoMpUsserD:IUN
In terms of actual process used in
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
Data Science, here’s how my teams
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
have worked:
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
help people ask the
ytinummoc ,tneilc :detratS weiV eivoM
discovery
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
right questions
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
t a eS e g n a h C
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
allow automation to
modeling place informed bets
deliver products at
integration scale to customers
build smarts into
apps product features
keep infrastructure
systems running, cost-effective
Monday, 28 January 13 12
13. For the process used with this Open Data app,
we chose to use Cascalog
by Nathan Marz, Sam Ritchie, et al., 2010
a DSL in Clojure which implements
Datalog, backed by Cascading
Some aspects of CS theory:
• Functional Relational Programming
• mitigates Accidental Complexity
• has been compared with Codd 1969
github.com/nathanmarz/cascalog/wiki
Monday, 28 January 13 13
14. Q:
Who uses Cascalog, other than Twitter?
A:
• Climate Corp (they’re hiring, ask for Crea)
• Factual
• Nokia Maps
• Harvard School of Public Health
• YieldBot (PDX)
• uSwitch (London)
• etc.
Monday, 28 January 13 14
15. pro:
• 10:1 reduction in code volume compared to SQL
• most advanced uses of Cascading
• Leiningen build: simple, no surprises, in Clojure itself
• test-driven development (TDD) for Big Data
• fault-tolerant workflows which are simple to follow
• machine learning, map-reduce, etc., started in LISP
years ago anywho
con:
• learning curve, limited number of Clojure developers
• aggregators are the magic, those take effort to learn
Monday, 28 January 13 15
16. Accidental Complexity:
Not O(N^2) complexity, but the costs of software
engineering at scale over time
What happens when you build recommenders,
then go work on other projects for six months?
What does it cost others to maintain your apps?
Cascalog allows for leveraging the same framework,
same code base, from Discovery phase through
to Systems phase
It focuses on the process of structuring data:
specify what you require, not how it must be achieved
Huge implications for software engineering
Monday, 28 January 13 16
18. discovery
The City of Palo Alto recently began to support
Open Data to give the local community greater
visibility into how their city government operates
This effort is intended to encourage students,
entrepreneurs, local organizations, etc., to build
new apps which contribute to the public good
paloalto.opendata.junar.com/dashboards/7576/
geographic-information/
Monday, 28 January 13 18
19. discovery
GIS about trees in Palo Alto:
Monday, 28 January 13 19
20. discovery
GIS about roads in Palo Alto:
Monday, 28 January 13 20
21. discovery
Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","
Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs
Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage:
Appraised Value: Hardscape: None Identifier: 40 Active
Numeric: 1 Location Feature ID: 13872 Provisional:
Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence:
20 Street_Name: Wilkie Way From Street PMMS: West Meadow
Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie
Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS:
567 Year Constructed: 1950 Traffic Count: 596 Traffic
Index: residential local Traffic Class: local residential
Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40
Paving Area: 8320 Surface Type: asphalt concrete Surface
Thickness:
Thickness:
2.0
6.0 (um, bokay…)
Base Type Pvmt:
Soil Class: 2
crusher run base
Soil Value: 15
Base
Curb Type:
Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1
District Number: 18 Land Use PMMS: 1 Overlay Year: 1990
Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure
Thickness: 6 Surface Treatment Year: Surface Treatment
Type: Alligator Severity: none Alligator Extent: 0
Block Severity: none Block Extent: 0 Longitude and
Transverse Severity: none Longitude and Transverse Extent: 0
Ravelling Severity: none Ravelling Extent: 0 Ridability
Monday, 28Severity:
January 13 none Trench Severity: none Trench Extent: 0 21
22. discovery
(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)
(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))
(specify what you require,
not how to achieve it…
addressing the 80%)
Monday, 28 January 13 22
23. discovery
(convert ad-hoc queries
into logical propositions)
Monday, 28 January 13 23
24. discovery
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point
(obtain recognizable
results)
Monday, 28 January 13 24
25. discovery
(curate valuable metadata)
Monday, 28 January 13 25
30. discovery
GIS Regex
tree
Scrub
export parse-tree species
M
Estimate
Join Geohash
height
Regex
src
parse-gis
M Tree
tree
Metadata
Failure
Traps
(flow diagram, gis tree)
Monday, 28 January 13 30
31. definitions
The conceptual flow diagram shows a directed, acyclic graph (DAG)
of taps, tuple streams, functions, joins, aggregations, assertions, etc.
Cascading is formally a pattern language – patterns of “plumbing”
fit together to ensure best practices for large-scale parallel processing
in risk-aversive environments – hard requirements of Enterprise IT
GIS Regex
tree
Scrub
export parse-tree species
M
Estimate
Join Geohash
height
Regex
src
parse-gis
M Tree
tree
Metadata
Failure
Traps
In other words, Cascading forces functional programming
through an API for JVM-based languages such as Java, Scala, Clojure
Through this approach, we define Enterprise Data Workflows
Monday, 28 January 13 31
32. definitions
pattern language: a structured method for
solving large, complex design problems, where
the syntax of the language promotes the use
of best practices
amazon.com/dp/0195019199
design patterns: originated in consensus
negotiation for architecture, later used in
OOP software engineering
amazon.com/dp/0201633612
Monday, 28 January 13 32
34. discovery
?blurb" " " Hawthorne Avenue from Alma Street to High Street
?traffic_count"3110
?traffic_class"local residential
?surface_type" asphalt concrete
?albedo" " " 0.12
?min_lat"" " 37.446140860599854"
?min_lng " " -122.1674652295435
?min_alt " " 0.0
?geohash"" " 9q9jh0
(another data product)
Monday, 28 January 13 34
35. discovery
The road data provides:
• traffic class (arterial, truck route, residential, etc.)
• traffic counts distribution
• surface type (asphalt, cement; age)
This leads to estimators for noise, reflection, etc.
Monday, 28 January 13 35
36. discovery
GIS
export
Regex
road
Regex
src
parse-gis parse-road
M
M
Estimate Road
Join
Albedo Segments
Geohash
Failure
Traps
R
Road
road
Metadata
(flow diagram, gis road)
Monday, 28 January 13 36
38. modeling
GIS data from Palo Alto provides us with
geolocation about each item in the export:
latitude, longitude, altitude
Geo data is great for managing municipal
infrastructure as well as for mobile apps
Predictive modeling in our Open Data
example focuses on leveraging geolocation
We use spatial indexing by creating
a grid of geohash values, for efficient
parallel processing
Cascalog queries collect items with the
same geohash values – using them as keys
for large-scale joins (Hadoop)
Monday, 28 January 13 38
39. modeling
geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162
9q9jh0
Monday, 28 January 13 39
40. modeling
Each road in the GIS export is listed as a block
between two cross roads, and each may have
multiple road segments to represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0
( lat1, lng1, alt1 )
( lat3, lng3, alt3 )
( lat0, lng0, alt0 )
( lat2, lng2, alt2 )
NB: segments in the raw GIS have the order
of geo coordinates scrambled: (lng, lat, alt)
Monday, 28 January 13 40
41. modeling
Our app analyzes each road segment as a data tuple,
calculating the center point for each:
( lat, lng, alt )
Monday, 28 January 13 41
42. modeling
Then uses a geohash to define a grid cell,
as a boundary (or “canopy”):
9q9jh0
Monday, 28 January 13 42
43. modeling
Query to join a road segment tuple with all the trees
within its geohash boundary:
9q9jh0
Monday, 28 January 13 43
44. modeling
Use distance-to-midpoint to filter trees which are
too far away to provide shade:
X X
X
Monday, 28 January 13 44
45. modeling
Calculate a sum of moments for tree height × distance
from road segment, as an estimator for shade:
∑( h·d )
We also calculate estimators for traffic frequency
and noise
Monday, 28 January 13 45
46. modeling
(defn get-shade [trees roads]
"subquery to join tree and road estimates, maximize for shade"
(<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric]
(roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash
?traffic_count _ ?traffic_class _ _ _ _)
(road-metric
?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _
?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height)
;; limit to trees which are higher than people
(> ?height 2.0)
(tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
;; limit to trees within a one-block radius (not meters)
(<= ?distance 25.0)
(/ ?height ?distance :> ?tree_moment)
(c/sum ?tree_moment :> ?sum_tree_moment)
;; magic number 200000.0 used to scale tree moment
;; based on median
(/ ?sum_tree_moment 200000.0 :> ?tree_metric)
))
Monday, 28 January 13 46
47. modeling
?road_name" " Hawthorne Avenue from Alma Street to High Street
?geohash"" " 9q9jh0
?road_lat" " 37.446140860599854
?road_lng " " -122.1674652295435
?road_alt " " 0.0
?road_metric" [1.0 0.5488121277250486 0.88]
?tree_metric" 4.36321007861036
(another data product)
Monday, 28 January 13 47
48. modeling
Filter
tree
height
M
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road shade
traffic
(flow diagram, shade)
Monday, 28 January 13 48
54. modeling
(defn get-reco [tracks shades]
"subquery to recommend road segments based on GPS tracks"
(<- [?uuid ?road ?geohash ?lat ?lng ?alt
?gps_count ?recent_visit ?road_metric ?tree_metric]
(tracks ?uuid ?geohash ?gps_count ?recent_visit)
(shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
))
(finally, the recommender)
Monday, 28 January 13 54
55. modeling
Recommenders combine multiple signals,
generally via weighted averages, to rank
personalized results:
• GPS of person ∩ road segment
• frequency and recency of visit
• traffic class and rate
• road albedo (sunlight reflection)
• tree shade estimator
Adjusting the mix allows for further
personalization at the end use
Monday, 28 January 13 55
57. integration
Hadoop is rarely ever used in isolation
System integration is a hard problem in Big Data,
especially social aspects: breaking down silos
Cascading was built for this purpose:
• taps across many data frameworks:
HBase, Cassandra, MongoDB, etc. GIS Regex
tree
Scrub
export parse-tree species
• support for a variety of data serialization:
M
Estimate
Join Geohash
height
Regex
src
Avro,Thrift, Kryo, JSON, etc.
parse-gis
M Tree
tree
Metadata
Failure
Traps
• planning on multiple topologies:
MapReduce, in-memory, tuple spaces, etc.
• test-driven development (TDD) at scale
• ANSI SQL-92 integration, PMML, etc.
Monday, 28 January 13 57
58. integration
This example focuses on the batch workflow
to examine best practices for parallel processing
Integrating with a mobile app requires next steps:
• push “reco” output to a Redis cluster
(caching layer) via a Cascading tap
• leverage Redis “sorted sets” for ranking
personalized results
• create lightweight API in Node.js + Nginx
for low-latency access at scale
• collect social interactions in Splunk
• instrument via Nagios, New Relic, Flurry, etc.
That provides a data service – doesn’t even begin
to address: design, user experience, marketing,
implementation, etc., for a complete app…
Monday, 28 January 13 58
59. integration
Batch workflow plus a data service:
web
web Redis web mobile
logsGIS
logs cluster app API
export Customers
Cascading app
source sink
tap tap
source
Recommender tap
trap source customer
tap tap Splunk
profile
Customer
DBs
Prefs
web
Support web
Hadoop cluster logs gps
review logs
tracks
Monday, 28 January 13 59
60. integration
In terms of deploying a batch workflow,
there are several considerations:
• build package for a “fat jar” (lein uberjar)
• continuous integration
• JAR repository
• cluster scheduling (e.g., EMR)
• instrumentation (Concurrent)
• troubleshooting from app layer
Monday, 28 January 13 60
62. apps
We work on discovery, modeling, integration – long before
coding an app. In a linear-logical sense, one might prefer a “waterfall”
approach; however, that would undermine core values – mitigating
Accidental Complexity – TDD, scalability, fault-tolerance, etc.
In lieu of SQL queries, we define a composable set of logical
propositions which can be executed, instrumented, tested, etc.,
independently for best practices at scale in parallel
Back to functional relational programming, particularly Datalog’s
logic programming, we use subqueries as logical propositions…
within a functional context… to leverage the relational model
• scalability: specify what you require, not how
• testability: disprove the opposites of propositions, to validate
Taken together in the context of Cascalog, now let’s build the app…
Monday, 28 January 13 62
65. apps
(results)
‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔
Monday, 28 January 13 65
66. apps
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
(flow diagram,
M R
for the
whole enchilada)
Monday, 28 January 13 66
67. definitions
Design principles in the Cascading API pattern language,
which help ensure best practices for Big Data apps in
an Enterprise context:
• specify what is required, not how it must be achieved
• provide the “glue” for system integration
• same JAR, any scale
• users want no surprises
• fail the same way twice
• plan far ahead
These points echo arguments about functional relational
programming (FRP) and Accidental Complexity
from Moseley/Marks 2006
Monday, 28 January 13 67
69. principle: same JAR, any scale
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+
Production Cluster:
Tb’s data
EMR w/ many HPC Instances
Ops monitors results
runtime: hours – days
Staging Cluster:
Gb’s data
EMR + a few Spot Instances
CI shows red or green lights
runtime: minutes – hours
Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes
Monday, 28 January 13 69
70. systems
#!/bin/bash -ex
# edit the `BUCKET` variable to use one of your S3 buckets:
BUCKET=temp.cascading.org/copa
SINK=out
# clear previous output (required by Apache Hadoop)
s3cmd del -r s3://$BUCKET/$SINK
# load built JAR + input data
s3cmd put target/copa.jar s3://$BUCKET/
s3cmd put -r data s3://$BUCKET/
# launch cluster and run
elastic-mapreduce --create --name "CoPA"
--debug --enable-debugging --log-uri s3n://$BUCKET/logs
--jar s3n://$BUCKET/copa.jar
--arg s3n://$BUCKET/data/copa.csv
--arg s3n://$BUCKET/data/meta_tree.tsv
--arg s3n://$BUCKET/data/meta_road.tsv
--arg s3n://$BUCKET/data/gps.csv
--arg s3n://$BUCKET/$SINK/trap
--arg s3n://$BUCKET/$SINK/park
--arg s3n://$BUCKET/$SINK/tree
--arg s3n://$BUCKET/$SINK/road
--arg s3n://$BUCKET/$SINK/shade
--arg s3n://$BUCKET/$SINK/gps
--arg s3n://$BUCKET/$SINK/reco
Monday, 28 January 13 70
74. Could combine this with a variety of data APIs:
• Trulia neighborhood data, housing prices
• Factual local business (FB Places, etc.)
• CommonCrawl open source full web crawl
• Wunderground local weather data
• WalkScore neighborhood data, walkability
• Data.gov US federal open data
• Data.NASA.gov NASA open data
• DBpedia datasets derived from Wikipedia
• GeoWordNet semantic knowledge base
• Geolytics demographics, GIS, etc.
• Foursquare,Yelp, CityGrid, Localeze,YP
• various photo sharing
Monday, 28 January 13 74
75. Data Quality: some species names have
spelling errors or misclassifications – could
be cleaned up and provided back to CoPA
to improve municipal services
Assumptions have been made about
missing data – were these appropriate
for the intended use case?
There are better ways to handle spatial
indexing: k-d trees, etc.
The tree data product needs: photos,
toxicity, natives vs. invasives,
common names, etc.
Monday, 28 January 13 75
76. Arguably, this is not a “large” data set:
• Palo Alto has 65K population
• great location for a POC
• prior to deploying in large metro areas
• CoPA is a leader in e-gov
• app is simpler to study on a laptop
Could extend to other cities with Open Data
initiatives:
SF, SJ, PDX, Seattle, VanBC…
Let’s get coverage for all of Ecotopia!
Monday, 28 January 13 76
77. Trulia: optimize sales leads using estimated
allergy zones, based on buyers’ real estate
preferences
Calflora: report new observations of invasives
endangered species, etc.; infer regions of affinity
for releasing beneficial insects
City of Palo Alto: assess zoning impact,
e.g., oleanders near day care centers; monitor
outbreaks of tree diseases (big impact on
property values)
start-ups: some invasive species are valuable
in Chinese medicine while others can be
converted to biodiesel – potential win-win
for targeted harvest services
Monday, 28 January 13 77
78. summary points
• geo data is great for municipal infrastructure and for mobile apps
• Cascading as a pattern language for Enterprise Data Workflows
• design principles in the API/pattern language ensure best practices
• focus on the process of structuring data; not un/structured
• Cascalog subqueries as composable logical propositions
• FRP mitigates the engineering costs of Accidental Complexity
• Data Science process: discovery, modeling, integration, apps, systems
• Hadoop is rarely ever used in isolation; breaking down silos is the
hard problem, which must be socialized to resolve
Monday, 28 January 13 78
80. references
by Paco Nathan
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721
Santa Clara, Feb 28, 1:30pm
strataconf.com/strata2013
Monday, 28 January 13 80
81. drill-down
blog, code/wiki/gists, maven repo, community, products:
cascading.org
github.org/Cascading
conjars.org
meetup.com/cascading
goo.gl/KQtUL
concurrentinc.com
we are hiring! Copyright @2013, Concurrent, Inc.
Monday, 28 January 13 81