A talk given in Austin, Texas on 2013-05-28 about how cognitive bias interferes with leveraging distributed systems for large-scale apps. Also, about design patterns for Enterprise data workflows. http://hadoop-and-beyond-austin.eventbrite.com/
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
Scala Toronto July 2019 event at 500px.
Pure Functional API Integration
Apache Spark Internals tuning
Performance tuning
Query execution plan optimisation
Cats Effects for switching execution model runtime.
Discovery / experience with Monix, Scala Future.
A talk given in Austin, Texas on 2013-05-28 about how cognitive bias interferes with leveraging distributed systems for large-scale apps. Also, about design patterns for Enterprise data workflows. http://hadoop-and-beyond-austin.eventbrite.com/
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
Scala Toronto July 2019 event at 500px.
Pure Functional API Integration
Apache Spark Internals tuning
Performance tuning
Query execution plan optimisation
Cats Effects for switching execution model runtime.
Discovery / experience with Monix, Scala Future.
R is an in-memory based scripting language and is capable of handling big data, tens of gigabytes and hundreds of millions of rows. And when combined with SAP HANA, R offers the potential to take the in-memory analytics to a whole new level. Imagine performing advanced statistical analysis such as decision tree, game-theory, linear and multiple regressions and much more inside SAP HANA on millions of rows and turning around with critical business insights at the speed of thought.
The Other Way of Doing Big Data: Declarative, Decoupled, Federated, Simple, and Resilient.
Also known as: How to Win at Scale and its Influence of People. Originally presented by Flip Kromer to the Research Board, http://www.researchboard.com/ June 2012
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
R is an in-memory based scripting language and is capable of handling big data, tens of gigabytes and hundreds of millions of rows. And when combined with SAP HANA, R offers the potential to take the in-memory analytics to a whole new level. Imagine performing advanced statistical analysis such as decision tree, game-theory, linear and multiple regressions and much more inside SAP HANA on millions of rows and turning around with critical business insights at the speed of thought.
The Other Way of Doing Big Data: Declarative, Decoupled, Federated, Simple, and Resilient.
Also known as: How to Win at Scale and its Influence of People. Originally presented by Flip Kromer to the Research Board, http://www.researchboard.com/ June 2012
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
UPDATED VERSION (2011): http://www.slideshare.net/plamere/music-recommendation-and-discovery
As the world of online music grows, music 2.0 recommendation systems become an increasingly important way for music listeners to discover new music.
Commercial recommenders such as Last.fm and Pandora have enjoyed commercial and critical success. But how well do these systems really work? How good are the recommendations? How far into The Long Tail do these recommenders reach?
In this tutorial we look at the current stateof theart in music recommendation. We examine current commercial and research systems, focusing on the advantages and the disadvantages of the various recommendation strategies. We look at some of the challenges in building music recommenders and we explore some of the ways that MIR techniques can be used to improve future recommenders.
Each month, join us as we highlight and discuss hot topics ranging from the future of higher education to wearable technology, best productivity hacks and secrets to hiring top talent. Upload your SlideShares, and share your expertise with the world!
Not sure what to share on SlideShare?
SlideShares that inform, inspire and educate attract the most views. Beyond that, ideas for what you can upload are limitless. We’ve selected a few popular examples to get your creative juices flowing.
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
Intro to Data Science for Enterprise Big DataPaco Nathan
If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com
An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.
Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/
Enterprise Data Workflows with CascadingPaco Nathan
Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
Keyword Services Platform (KSP) from Microsoft adCentergoodfriday
Come learn how the KSP will revolutionize the search industry by allowing advertisers and developers to build KSP applications using public APIs. This session describes KSP and how it works. It also includes partners that will describe the unique power that this tool offers to advertisers
The Comprehensive Approach: A Unified Information ArchitectureInside Analysis
The Briefing Room with Richard Hackathorn and Teradata
Slides from the Live Webcast on May 29, 2012
The worlds of Business Intelligence (BI) and Big Data Analytics can seem at odds, but only because we have yet to fully experience comprehensive approach to managing big data – a Unified Big Data Architecture. The dynamics continue to change as vendors begin to emphasize the importance of leveraging SQL, engineering and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing.
Register for this episode of The Briefing Room to learn the value of taking a strategic approach for managing big data from veteran BI and data warehouse consultant Richard Hackathorn. He'll be briefed by Chris Twogood of Teradata, who will outline his company's recent advances in bridging the gap between Hadoop and SQL to unlock deeper insights and explain the role of Teradata Aster and SQL-MapReduce as a Discovery Platform for Hadoop environments.
For more information visit: http://www.insideanalysis.com
Watch us on YouTube: http://www.youtube.com/playlist?list=PL5EE76E2EEEC8CF9E
Front-Ending the Web with Microsoft Officegoodfriday
Come learn how to make your Web service instantly recognizable to over 400 million people worldwide. Hear how Microsoft Office has evolved to provide for developers to extend the world's most widely used productivity suite with services from the Web.
Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose.
http://meetup.com/SF-Bay-ACM/events/221693508/
Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets.
O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Building Enterprise Apps for Big Data with Cascading
1. Building Enterprise Apps
for Big Data with Cascading
Paco Nathan
Document
Collection
Scrub
Tokenize
token
Concurrent, Inc.
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
pnathan@concurrentinc.com Count
@pacoid
Word
Count
Copyright @2012, Concurrent, Inc.
2. Enterprise Apps
for Big Data
with Cascading
1. backstory: how we got here
2. build: Data Science teams
3. pattern: common use cases
4. intro: Cascading API
5. tutorial: for the impatient
6. code: sample apps
3. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. backstory:
how we got here
4. inflection point
huge Internet successes after 1997 holiday season… 1997
AMZN, EBAY, Inktomi (YHOO Search), then GOOG
1998
consider this metric:
annual revenue per customer / amount of data stored
which dropped 100x within a few years after 1997 2004
storage and processing costs plummeted, now we must
work much smarter to extract ROI from Big Data…
our methods must adapt
“conventional wisdom” of RDBMS and BI tools became
less viable; however, business cadre was still focused on
pivot tables and pie charts… which tends toward inertia!
MapReduce and the Hadoop open source stack grew
directly out of that contention… however, that effort +
only solves parts of the puzzle
5. inflection point: consequences
Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
Hadoop Summit, 2012:
“All of Fortune 500 is now on notice over the next 10-year period.”
Amazon and Google as exemplars of massive disruption in retail,
advertising, etc.
data as the major force displacing Global 1000 over the next decade,
mostly through apps — verticals, leveraging domain expertise
Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
XLDB, 2012:
“Complex analytics workloads are now displacing SQL as the basis
for Enterprise apps.”
6. primary sources
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM
Google
“The Birth of Google” – John Battelle
wired.com/wired/archive/13.08/battelle.html
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
12. data innovation: circa 2013
Customers
Data Apps
business
Domain process Workflow Prod
Expert
dashboard Web Apps,
metrics
History services Mobile,
data etc. s/w
science dev
Data
Planner
Scientist
social
discovery optimized interactions
+ capacity transactions, Eng
endpoints
modeling content
App Dev
Data Access Patterns
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch "real time"
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
14. statistical thinking
Process Variation Data Tools
employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must!
15. reference
by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L
16. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. build:
Data Science teams
17. core values
Data Science teams develop actionable insights,
building confidence for decisions
that work may influence a few decisions worth
billions (e.g., M&A) or billions of small decisions
(e.g., AdWords)
probably somewhere in-between…
Wikipedia
solving for pattern, at scale.
an interdisciplinary pursuit which
requires teams, not sole players
18. most valuable skills
approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log file analysis, etc.
unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
D3
the rest of the skills – modeling,
algorithms, etc. – those are secondary
19. social caveats
“This data cannot be correct!” may be an early warning
about an organization itself
much depends on how the people whom you work alongside
tend to arrive at decisions:
‣ probably good: Induction, Abduction, Circumscription
‣ probably poor: Deduction, Speculation, Justification
in general, one good data visualization
puts many ongoing verbal arguments to rest
however, let domain experts handle
“data storytelling”, not data scientists
xkcd
20. the science in data science?
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
in a nutshell, what we do…
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
‣ estimate probability
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
‣ calculate analytic variance
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
‣ manipulate order complexity
‣ make use of learning theory
+ collab with DevOps, Stakeholders
+ reduce our work to cron entries
21. synthesis of the above
MapReduce is Good Enough?
Jimmy Lin, U Maryland + Twitter
arxiv.org/pdf/1209.2191v1.pdf
A Few Useful Things to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
22. team process = needs
help people ask the
discovery right questions
allow automation to place
modeling informed bets
deliver products at
integration scale to customers
build smarts into
apps product features Gephi
keep infrastructure
systems running, cost-effective
23. team composition = roles
Domain
Expert
business process,
stakeholder
data
science
Data data prep, discovery,
Scientist modeling, etc. Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
App Dev
software engineering, Count
automation Word
Count
Ops systems engineering, access
introduced
capability
24. matrix = needs × roles
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps tem
tem
ss
diisc
d sc mod
mod nteg
ii nteg sys
sys
stakeholder
scientist
developer
ops
25. matrix: example team
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps tem
tem
ss
diisc
d sc mod
mod nteg
ii nteg sys
sys
stakeholder
scientist
developer
ops
summary: this team seems heavy on systems, may need more overlap
between modeling and integration, particularly among team leads
26. Q:
Can I simply hire one
rockstar data scientist
to cover all this work?
27. A: No, interdisciplinary
work requires teams.
A: Hire leads who speak
the lingo of each domain.
A: Hire people who cover
2+ roles, when possible.
28. reference
by DJ Patil
Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
29. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
3. pattern:
common use cases
30. CAP theorem
purpose: theoretical limits for data access patterns
essence:
‣ consistency
‣ availability
‣ partition tolerance
best case scenario: you may pick two … or spend billions
struggling to obtain all three at scale (GOOG)
translated: cost of doing business
www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
31. data access patterns
design patterns: originated in consensus negotiation
for architecture, later used in software engineering
consider the corollaries in large-scale data work…
essence: select data frameworks based on
your data access patterns
in other words, decouple use cases based on needs
– avoid the “one size fits all” (OSFA) anti-pattern
let’s review some examples…
32. access → frameworks → forfeits
financial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene/Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/HBase CxP
33. access → frameworks → forfeits
financial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene/Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/HBase CxP
37. use case: marketing funnel
• must optimize a very large ad spend
• different vendors report different metrics
Wikipedia
• seasonal variation distorts performance
• some campaigns are much smaller than others
• hard to predict ROI for incremental spend
approach:
• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• customer lifetime value quantifies ROI of new leads
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/benefit
• linear programming models estimate elasticity of demand
38. use case: ecommerce fraud
• sparse data means lots of missing values
stat.berkeley.edu
• “needle in a haystack” lack of training cases
• answers are available in large-scale batch, results
are needed in real-time event processing
• not just one pattern to detect – many, ever-changing
approach:
• random forest (RF) classifiers predict likely fraud
• subsampled data to re-balance training sets
• impute missing values based on density functions
• train on massive log files, run on in-memory grid
• adjust metrics to minimize customer support costs
• detect novelty – report anomalies via notifications
39. use case: customer segmentation
• many millions of customers, hard to determine
which features resonate
Mathworks
• multi-modal distributions get obscured by the
practice of calculating an “average”
• not much is known about individual customers
approach:
• connected components for sessionization, determining
uniques from logs
• estimates for age, gender, income, geo, etc.
• clustering algorithms to group into market segments
• social graph infers “unknown” relationships
• covariance/heat maps visualizes segments vs. feature sets
40. use case: monetizing content
• need to suggest relevant content which would
Digital Humanities
otherwise get buried in the back catalog
• big disconnect between inventory and limited
performance ad market
• enormous amounts of text, hard to categorize
approach:
• text analytics glean key phrases from documents
• hierarchical clustering of char frequencies detects lang
• latent dirichlet allocation (LDA) reduces dimension to
topic models
• recommenders suggest similar topics to customers
• collaborative filters connect known users with less known
41. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
4. intro:
Cascading API
42. Cascading API: purpose
‣ simplify data processing development and deployment
‣ improve application developer productivity
‣ enable data processing application manageability
43. Cascading API: a few facts
Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.
in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
Finance, Health Care, Transportation, other verticals
studies published about large use cases: Twitter, Etsy, Airbnb, Square,
Climate Corporation, FlightCaster, Williams-Sonoma
partnerships and distribution with SpringSource, Amazon AWS,
Microsoft Azure, Hortonworks, MapR, EMC
several open source projects built atop, managed by Twitter, Etsy, etc.,
which provide substantial Machine Learning libraries
DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy
data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
plus serialization in Apache Thrift, Avro, Kyro, etc.
entire app compiles into a single JAR: fully connected for compiler optimization,
exception handling, debugging, config, scheduling, etc.
44. Cascading API: a few quotes
“Cascading gives Java developers the ability to build Big Data applications
on Hadoop using their existing skillset … Management can really go out
and build a team around folks that are already very experienced with Java.
Switching over to this is really a very short exercise.”
CIO, Thor Olavsrud, 2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading
“Masks the complexity of MapReduce, simplifies the programming, and
speeds you on your journey toward actionable analytics … A vast
improvement over native MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck, 2012-09-18
infoworld.com/slideshow/65089
“Company’s promise to application developers is an opportunity to build
and test applications on their desktops in the language of choice with
familiar constructs and reusable components”
Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
45. data+code “political spectrum”
“Notes from the Mystery Machine Bus”
by Steve Yegge, Google
goo.gl/SeRZa
“conservative” “liberal”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
46. Cascading API: adoption
As Enterprise apps move into
Hadoop and related BigData
frameworks, risk profiles shift
toward more conservative
programming practices
Cascading provides a popular
API for defining and managing
Enterprise data workflows
47. enterprise data workflows
Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc.
…in other words, “plumbing”
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
48. data workflows: team
‣ Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
‣ Systems Integrator POV:
system integration of heterogenous data sources and compute platforms
‣ Data Scientist POV:
a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.
‣ Data Architect POV:
a physical plan for large-scale data flow management
‣ Software Architect POV:
a pattern language, similar to plumbing or circuit design
Document
Collection
‣ App Developer POV: M
Tokenize
Scrub
token
API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word
List
HashJoin
Left
RHS
Regex
token
GroupBy
token
R
Count
‣ Systems Engineer POV: Word
Count
a JAR file, has passed CI, available in a Maven repo
49. data workflows: layers
business domain expertise, business trade-offs,
process operating parameters, market position, etc.
API Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
language
…envision whatever runs in a JVM
optimize /
schedule major changes in technology now
Document
Collection
Scrub
Tokenize
token
physical
M
HashJoin Regex
Left token
GroupBy R
plan
Stop Word token
List
RHS
Count
Word
Count
compute Apache Hadoop, in-memory local mode
“assembler”
code
substrate
…envision GPUs, streaming, etc.
machine
data Splunk, Nagios, Collectd, New Relic, etc.
50. data workflows: SQL
Relational
SQL parser
logical plan,
optimized based on stats
physical plan
query history,
table stats
b-trees, etc.
ERD
table schema
catalog
51. data workflows: SQL vs. JVM
Relational Cascading + Driven
SQL parser SQL-92 compliant parser
(in progress)
logical plan, TODO: logical plan,
optimized based on stats optimized based on stats
physical plan API “plumbing”
query history, app history,
table stats tuple stats
b-trees, etc. distributed compute substrate:
Hadoop, in-memory, etc.
ERD flow diagram
table schema tuple schema
catalog endpoint usage DB
52. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
5. tutorial:
for the impatient
53. “Cascading for the Impatient”
cascading.org/category/impatient/
‣ a series of introductory tutorials and code samples
‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
54. 1: copy
public class
Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Source
Properties props = new Properties();
AppProps.setApplicationJarClass( props, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );
M // create the sink tap
Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
Sink
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();
1 mapper }
}
0 reducers
10 lines code
55. wait!
ten lines of code
for a file copy…
seems like a lot.
56. same JAR, any scale…
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+
Production Cluster:
Tb’s data
EMR w/ 50 HPC Instances
Ops monitors results
runtime: hours – days
Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours
Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes
57. 2: word count
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 mapper
1 reducer
18 lines code gist.github.com/3900702
58. Cascading / Java Document
Collection
M
Tokenize
GroupBy
token Count
String docPath = args[ 0 ]; R Word
String wcPath = args[ 1 ]; Count
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
59. Scalding / Scala Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
// Sujit Pal
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
package com.mycompany.impatient
import com.twitter.scalding._
class Part2(args : Args) extends Job(args) {
val input = Tsv(args("input"), ('docId, 'text))
val output = Tsv(args("output"))
input.read.
flatMap('text -> 'word) {
text : String => text.split("""s+""")
}.
groupBy('word) { group => group.size }.
write(output)
}
60. Cascalog / Clojure Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
; Paul Lam
; github.com/Quantisan/Impatient
(ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
61. Hive Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
-- Steve Severance
-- stackoverflow.com/questions/10039949/word-count-program-in-hive
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv'
OVERWRITE INTO TABLE input;
SELECT
word, COUNT(*)
FROM input
LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word
;
62. Pig Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
-- kudos to Dmitriy Ryaboy
docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
63. 3: wc + scrub
Document
Collection
Scrub GroupBy
Tokenize
token token
Count
M
R Word
Count
1 mapper
1 reducer
22+10 lines code
64. 4: wc + scrub + stop words
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
1 mapper Word
1 reducer Count
28+10 lines code
65. 5: tf-idf
Unique Insert SumBy
D
doc_id 1 doc_id
Document
Collection
M R M R M RHS
Scrub
Tokenize
token
HashJoin
M
RHS
token
HashJoin Regex Unique GroupBy
DF
Left token token token ExprFunc
Count CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF
M
GroupBy
TF
doc_id,
token Count
GroupBy Count
token
M R M R
Word
R M R Count
11 mappers
9 reducers
65+10 lines code
66. 6: tf-idf + tdd
Unique Insert SumBy
D
doc_id 1 doc_id
Document
Collection
RHS
M R M R M
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M
M
RHS
token
HashJoin Regex Unique GroupBy
DF
Left token token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS
M R M R M R
TF-IDF
M
GroupBy
TF
doc_id,
Failure token Count
Traps GroupBy Count
token
M R M R
Word
Count
R M R
12 mappers
9 reducers
76+14 lines code
68. results? doc_id tf-idf
doc02 0.9163
token
air
doc05 0.9163 australia
doc05 0.9163 broken
doc04 0.9163 california's
doc04 0.9163 cause
doc02 0.9163 cloudcover
doc04 0.9163 death
doc04 0.9163 deserts
doc03 0.9163 downwind
doc_id text …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such
with less rain and cloudcover. doc04 0.9163 valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women
side of a mountain. doc03 0.5108 land
doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land
deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee
zoink null doc03 0.5108 leeward
doc04 0.5108 leeward
doc01 0.4463 area
doc02 0.2231 area
doc03 0.2231 area
doc01 0.2231 dry
doc02 0.2231 dry
doc03 0.2231 dry
doc02 0.2231 mountain
Unique Insert SumBy
D
doc_id 1 doc_id
Document
Collection
RHS
M R M R M
doc03 0.2231 mountain
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M
M
RHS
token
HashJoin Regex Unique GroupBy
DF
Left token
doc04 0.2231 mountain
token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS
M R M R M R
TF-IDF
GroupBy
M
doc01 0.0000 rain
TF
doc_id,
Failure token Count
Traps GroupBy Count
token
doc02 0.0000 rain
M R M R
Word
Count
R M R
doc03 0.0000 rain
doc04 0.0000 rain
doc01 0.0000 shadow
doc02 0.0000 shadow
doc03 0.0000 shadow
doc04 0.0000 shadow
69. comparisons?
compare similar code in Scalding (Scala) and Cascalog (Clojure):
sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
based on: github.com/twitter/scalding/wiki
github.com/Quantisan/Impatient
based on: github.com/nathanmarz/cascalog/wiki
70. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
6. code:
sample apps
71. Social Recommender
filter
Twitter stop words
tweets
calculate
QA
similiarity
threshold
min, max
Neo4j
LDA Redis
github.com/Cascading/SampleRecommender
‣ social recommender based on Twitter: suggest users who tweet about similar stocks
‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop
‣ uses a stop word list to remove common words, offensive phrases, etc.
‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc.
‣ adapted in Spring by Costin Leau
72. SocRec: architecture
Twitter filter low-freq
firehose source stop words
tweets batch updates
( uid, tweet, t )
checkpoint:
tokenized tweets
calculate checkpoint: analysis +
QA
similiarity token frequency curation
checkpoint: similarity
similar users thresholds
threshold
min, max
sink
sink sink
Neo4j:
social Redis
graph LDA:
topic results
(uid: uidx, rank)
trending
74. City of Palo Alto open data
Regex Regex
tree
Scrub
filter parser species
M
HashJoin
Left Geohash
CoPA
GIS exprot Tree
Metadata M
RHS RHS
tree
Regex Checkpoint
road
Regex Regex
tsv
parser tsv filter Tree Filter GroupBy Checkpoint
parser CoGroup
Distance tree_dist tree_name shade
M
R M R M RHS
M
HashJoin Estimate Road
Left Albedo Segments Geohash CoGroup
Road
Metadata GPS
Failure RHS M logs
Traps R
road
Geohash
M
Regex
park
filter reco
M
park
github.com/Cascading/CoPA/wiki
‣ GIS export for parks, roads, trees (unstructured / open data)
‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
‣ curated metadata, used to enrich the dataset
‣ could extend via mash-up with many available public data APIs
Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”