Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
Interview questions on Apache spark [part 2]knowbigdata
This is Apache Spark Question & Answer Tutorial.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
Interview questions on Apache spark [part 2]knowbigdata
This is Apache Spark Question & Answer Tutorial.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
This presentation attempts to make the concepts of the Carver model of "Policy Governance" (registered trade mark) available to small nonprofits and their boards
BFIT for healthcare is a Document Management System design for the healthcare industry. BFIT will provide your hospital or clinic a comprehensive database repository system to keep your patient database, their medical records, x-ray negative, letter of reference and etc.. important note in one system. The system also provide a notepad and image editor to doctors to records their observation and solution just like recording to the patient's card.
It is a system used to manage, track and
store documents and reduce paper.
It enable organizations to manage tasks
effectively and streamline processing of
their documents across all departments.
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsScott Abel
Presented by Jeff Potts at Documentation and Training Life Sciences, June 23-26, 2008 in Indianapolis.
Every organization struggles with how to store, tag, and search for their documents. In a hospital corporation, the need is particularly critical. Hospital staff need to be able to quickly find the latest policies and procedures. Auditors need to be able to track who made what changes and when. Lawyers want to know which protocols were in place on a particular date. In this session you’ll learn a practical approach to putting a document management system in place that can help address these needs and reduce your exposure to legal, regulatory, and even human health risks.
Based on lessons learned during a real-world project, the session shows that getting your documents under control doesn’t have to be a multi-year, multi-million dollar effort. The slide deck outlines how a hospital corporation in New England used a “start small and grow” approach to piloting and rolling out a document management solution across the corporation.
Introduction to MapReduce Data Transformationsswooledge
MapReduce is a framework for scalable parallel data processing popularized by Google. Although initially used for simple large-scale text processing, map/reduce has recently been expanded to serve some application tasks normally performed by traditional relational databases.
You Will Learn
* The basics of Map/Reduce programming in Java
* The application domains where the framework is most appropriate
* How to build analytic database systems that handle large datasets and multiple data sources robustly
* Evaluate data warehousing vendors in a realistic and unbiased way
* Emerging trends to combine Map/Reduce with standard SQL for improved power and efficiency
Geared To
* Programmers
* Developers
* Database Administrators
* Data warehouse managers
* CIOs
* CTOs
The Chief Data Officer Agenda: Metrics for Information and Data ManagementDATAVERSITY
Welcome to The Chief Data Officer Agenda, a DATAVERSITY monthly webinar focused on the emerging priorities of the Chief Data Officer (CDO). What issues are CDOs facing now, and what should be on their Agenda. The webinar series is moderated by DATAVERSITY CEO and Founder, Tony Shaw, who will be joined each month by guest experts to discuss the requirements and demands on the burgeoning CDO role.
This month in the series:
The value proposition of enterprise information management is founded on Information being treated as an Asset. Information management professionals concur, but CxOs will say "So what?" In most organizations, they are both right! The conflict starts with one group thinking metaphorically, and the other literally. CDOs know that “Information asset” needs to be more than a metaphor…it has to be actionable. When you’re in charge of the application and value of data, how do you measure that? How do you measure progress? What types of metrics are there and which ones actually work? There is a lot more to measuring the value of information than common ROI.
This presentation will give you some starting points for real information asset management and information economics. You’ll learn some of the techniques being used successfully today, and considerations for quantifying the value and progress of information management. There is a means of reconciliation between the metaphors and reality, and this talk will outline a vision for the future, but with practical steps to help you get there.
In this session, we'll discuss architectural, design and tuning best practices for building rock solid and scalable Alfresco Solutions. We'll cover the typical use cases for highly scalable Alfresco solutions, like massive injection and high concurrency, also introducing 3.3 and 3.4 Transfer / Replication services for building complex high availability enterprise architectures.
Slide deck from an Alfresco Webinar which can be viewed at http://blogs.alfresco.com/wp/webcasts/2009/05/alfresco-webcast-a-developers-guide-1-capabilities-architecture-optaros/
This presentation discusses what Alfresco is an options for working with Alfresco from a developer perspective.
Alfresco 5.2 Introduces New Public REST APIs
For an update, please see: https://www.slideshare.net/jvonka/exciting-new-alfresco-apis
https://www.meetup.com/Alfresco-Meetups/events/236987848/
An overview of the new and enhanced APIs will be discussed and some of the key endpoints demonstrated via Postman so that by the time you leave you should have enough knowledge to create a simple client or integration.
These APIs will also be the foundation for new clients developed for the Alfresco Digital Business Platform.
We'll have a sneak peek at what's coming next and leave plenty of time for questions, feedback and open discussion.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
It introduces the performance analysis of OpenStack Cloud with the commodity computers in the big data environments. It concludes that the data storage and analysis in hadoop cluster in cloud is more flexible and easily scalable than the real system cluster. It also concludes the cluster in commodities computers are faster than the cloud clusters.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
How to create an enterprise data lake for enterprise-wide information storage and sharing? The data lake concept, architecture principles, support for data science and some use case review.
Introduction to Big Data and how FIWARE manage it through the different approaches. What are the differences between Apache Flink and Spark approaches. Introduction to FIWARE Connectors to manage NGSI context information. Brief introduction to Machine Learning with FIWARE technology
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
Shared by Mansoor Mirza
Distributed Computing
What is it?
Why & when we need it?
Comparison with centralized computing
‘MapReduce’ (MR) Framework
Theory and practice
‘MapReduce’ in Action
Using Hadoop
Lab exercises
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose.
http://meetup.com/SF-Bay-ACM/events/221693508/
Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets.
O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
1. Getting Started on Hadoop
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
http://www.meetup.com/cloudcomputing/calendar/13911740/
Paco Nathan
@pacoid
http://ceteri.blogspot.com/
Examples of Hadoop Streaming, based on Python scripts
running on the AWS Elastic MapReduce service.
• first, a brief history…
• AWS Elastic MapReduce
• “WordCount” example as “Hello World” for MapReduce
• text mining Enron Email Dataset from Infochimps.com
• inverted index, semantic lexicon, social graph
• data visualization using R and Gephi
All source code for this talk is available at:
http://github.com/ceteri/ceteri-mapred
2. How Does MapReduce Work?
map(k1, v1) → list(k2, v2)
reduce(k2, list(v2)) → list(v3)
Several phases, which partition a problem into many tasks:
• load data into DFS…
• map phase: input split → (key, value) pairs, with optional combiner
• shuffle phase: sort on keys to group pairs… load-test your network!
• reduce phase: each task receives the values for one key
• pull data from DFS…
NB: “map” phase is required, the rest are optional.
Think of set operations on tuples (and check out Cascading.org).
Meanwhile, given all those (key, value) pairs listed above, it’s no
wonder that key/value stores have become such a popular topic of
conversation…
3. How Does MapReduce Work?
map(k1, v1) → list(k2, v2)
reduce(k2, list(v2)) → list(v3)
The property of data independence among tasks allows for highly
parallel processing… maybe, if the stars are all aligned :)
Primarily, a MapReduce framework is largely about fault tolerance, and
how to leverage “commodity hardware” to replace “big iron” solutions…
That phrase “big iron” might apply to Oracle + NetApp. Or perhaps an
IBM zSeries mainframe… Or something – expensive, undoubtably.
Bonus questions for self-admitted math geeks: Foresee any concerns
about O(n) complexity, given the functional definitions listed above?
Keep in mind that each phase cannot conclude and progress to the
next phase until after each of its tasks has successfully completed.
4. A Brief History…
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
http://www-formal.stanford.edu/jmc/history/lisp/lisp.htm
circa 2004 – Google
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
http://labs.google.com/papers/mapreduce.html
circa 2006 – Apache
Hadoop, originating from the Nutch Project
Doug Cutting
http://research.yahoo.com/files/cutting.pdf
circa 2008 – Yahoo
web scale search indexing
Hadoop Summit, HUG, etc.
http://developer.yahoo.com/hadoop/
circa 2009 – Amazon AWS
Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, etc.
http://aws.amazon.com/elasticmapreduce/
5. Why run Hadoop in AWS?
• elastic: batch jobs on clusters can consume many nodes,
scalable demand, not 24/7 – great case for using EC2
• commodity hardware: MR is built for fault tolerance, great
case for leveraging AMIs
• right-sizing: difficult to know a priori how large of a cluster
is needed – without running significant jobs (test k/v skew,
data quality, etc.)
• when your input data is already in S3, SDB, EBS, RDS…
• when your output needs to be consumed in AWS …
You really don't want to buy rack space in a datacenter before
assessing these issues – besides, a private datacenter probably
won’t even be cost-effective afterward.
6. But why run Hadoop on Elastic MapReduce?
• virtualization: Hadoop needs some mods to run well in
that kind of environment
• pay-per-drink: absorbs cost of launching nodes
• secret sauce: Cluster Compute Instances (CCI) and
Spot Instances (SI)
• DevOps: EMR job flow mgmt optimizes where your staff
spends their (limited) time+capital
• logging to S3, works wonders for troubleshooting
7. A Tale of Two Ventures…
Adknowledge: in 2008, our team became one of the larger
use cases running Hadoop on AWS
• prior to the launch of EMR
• launching clusters of up to 100 m1.xlarge
• initially 12 hrs/day, optimized down to 4 hrs/day
• displaced $3MM capex for Netezza
ShareThis: in 2009, our team used even more Hadoop
on AWS than that previous team
• this time with EMR
• larger/more frequent jobs
• lower batch failure rate
• faster turnaround on results
• excellent support
• smaller team required
• much less budget
8. “WordCount”, a “Hello World” for MapReduce
Definition: count how often each word appears within a collection
of text documents.
A simple program which illustrates a pretty good test case for what
MapReduce can perform, since it incorporates:
• minimal amount code
• document feature extraction (where words are “terms”)
• symbolic and numeric values
• potential use of a combiner
• bipartite graph of (doc, term) tuples
• not so many steps away from useful indexing…
When a framework can run “WordCount” in parallel at scale, then it
can handle much larger, more interesting compute problems as well.
9. Bipartite Graph
Wikipedia: “…a bipartite graph is a graph whose vertices can be divided
into two disjoint sets U and V such that every edge connects a vertex in U
to one in V… ”
http://en.wikipedia.org/wiki/Bipartite_graph
Consider the case where:
U ≡ { documents }
V ≡ { terms }
Many kinds of text analytics products
can be constructed based on this
data structure as a foundation.
10. “WordCount”, in other words…
map(doc_id, text)
→ list(word, count)
reduce(word, list(count))
→ list(sum_count)
11. “WordCount”, in other words…
void map (String doc_id, String text):
for each word w in segment(text):
emitPartial(w, "1");
void reduce (String word, Iterator partial_counts):
int count = 0;
for each pc in partial_counts:
count += Int(pc);
emitResult(String(count));
12. Hadoop Streaming
One way to approach MapReduce jobs in Hadoop is to use streaming.
In other words, use any kind of script which can be run from a command
line and read/write data via stdin and stdout:
http://hadoop.apache.org/common/docs/current/streaming.html#Hadoop+Streaming
The following examples use Python scripts for Hadoop Streaming. One
really great benefit is that then you can dev/test/debug your MapReduce
code on small data sets from a command line simply by using pipes:
cat input.txt | mapper.py | sort | reducer.py
BTW, there are much better ways to handle Hadoop Streaming in Python
on Elastic MapReduce – for example, using the “boto” library. However,
these examples are kept simple so they’ll fit into a tech talk!
15. “WordCount”, in other words…
# this Linux command line...
cat foo.txt | map_wc.py | sort | red_wc.py
# produces output like this...
tuple
9
term
6
tfidf
6
sort
5
analysis
2
wordcount
1
user
1
# depending on input -
# which could be HTML content, tweets, email, etc.
16. Speaking of Email…
Enron pioneered innovative corporate accounting methods and energy market
manipulations, involving a baffling array of fraud techniques. The firm soared to
a valuation of over $60B (growing 56% in 1999, 87% in 2000) while inducing a
state of emergency in California – which cost the state over $40B. Subsequent
prosecution of top execs plus the meteoric decline in the firm’s 2001 share value
made for a spectacular #EPIC #FAIL
http://en.wikipedia.org/wiki/Enron_scandal
http://en.wikipedia.org/wiki/California_electricity_crisis
Thanks to CALO and Infochimps, we have a half million email messages
collected from Enron managers during their, um, “heyday” period:
http://infochimps.org/datasets/enron-email-dataset--2
http://www.cs.cmu.edu/~enron/
Let’s use Hadoop to help find out: what were some of
the things those managers were talking about?
17. Simple Text Analytics
Extending from how “WordCount” works, we’ll add multiple kinds of output
tuples, plus two stages of mappers and reducers, to generate different kinds
of text analytics products:
• inverted index
• co-occurrence analysis
• TF-IDF filter
• social graph
While doing that, we'll also perform other statistical
analysis and data visualization using R and Gephi
18. Mapper 1: RFC822 Parser
map_parse.py takes a list of URI for where to read email messages, parses
each message, then emits multiple kinds of output tuples:
(doc_id, msg_uri, date)
(sender, receiver, doc_id)
(term, term_freq, doc_id)
(term, co_term, doc_id)
Note that our dataset includes approximately 500,000 email messages, with an
average of about 100 words in each message.
Also, there are 10E+5 unique terms. That will tend to be a constant in English
texts, which is great to know when configuring capacity.
19. Reducer 1: TF-IDF and Co-Occurrence
red_idf.py takes the shuffled output from map_parse.py, collects metadata
for each term, calculates TF-IDF to use in a later stage for filtering, calculates
co-occurrence probability, then emits all these results:
(doc_id, msg_uri, date)
(sender, receiver, doc_id)
(term, idf, count)
(term, co_term, prob_cooc)
(term, tfidf, doc_id)
(term, max_tfidf)
23. Mapper 2 + Reducer 2: Threshold Filter
map_filter.py and red_filter.py apply a threshold (based on statistical
analysis of TF-IDF) to filter results of co-occurrence analysis so that we begin to
produce a semantic lexicon for exploring the data set.
How do we determine a reasonable value for the TF-IDF threshold, for filtering
terms? Sampling from the (term, max_tfidf) tuple, we run summary stats and
visualization in R:
cat dat.idf | util_extract.py m > thresh.tsv
We also convert the sender/receiver social graph into CSV format for Gephi
visualization:
cat dat.parsed | util_extract.py s | util_gephi.py | sort -u > graph.csv
38. Best Practices
• Again, there are much more efficient ways to handle Hadoop Streaming
and Text Analytics…
• Unit Tests, Continuous Integration, etc., – all great stuff, but “Big Data”
software engineering requires additional steps
• Sample data, measure data ratios and cluster behaviors, analyze in R,
visualize everything you can, calibrate any necessary “magic numbers”
• Develop and test code on a personal computer in IDE, cmd line, etc., using
a minimal data sets
• Deploy to staging cluster with larger data sets for integration tests and QA
• Run in production with A/B testing were feasible to evaluate changes
quantitatively
• Learn from others at meetups, unconfs, forums, etc.
39. Great Resources for Diving into Hadoop
Google: Cluster Computing and MapReduce Lectures
http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Amazon AWS Elastic MapReduce
http://aws.amazon.com/elasticmapreduce/
Hadoop: The Definitive Guide, by Tom White
http://oreilly.com/catalog/9780596521981
Apache Hadoop
http://hadoop.apache.org/
Python “boto” interface to EMR
http://boto.cloudhackers.com/emr_tut.html
40. Excellent Products for Hadoop in Production
Datameer
http://www.datameer.com/
“Democratizing Big Data”
Designed for business users, Datameer Analytics Solution (DAS) builds
on the power and scalability of Apache Hadoop to deliver an easy-to-use
and cost-effective solution for big data analytics. The solution integrates
rapidly with existing and new data sources to deliver sophisticated
analytics.
Cascading
http://www.cascading.org/
Cascading is a feature-rich API for defining and executing complex,
scale-free, and fault tolerant data processing workflows on a Hadoop
cluster, which provides a thin Java library that sits on top of Hadoop's
MapReduce layer. Open source in Java.
41. Scale Unlimited – Hadoop Boot Camp
Santa Clara, 22-23 July 2010
http://www.scaleunlimited.com/courses/hadoop-bootcamp-santaclara
• An extensive overview of the Hadoop architecture
• Theory and practice of solving large scale data processing problems
• Hands-on labs covering Hadoop installation, development, debugging
• Common and advanced “Big Data” tasks and solutions
Special $500 discount for SVCC Meetup members:
http://hadoopbootcamp.eventbrite.com/?discount=DBDatameer
Sample material – list of questions about intro Hadoop from
the recent BigDataCamp:
http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp
42. Getting Started on Hadoop
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
http://www.meetup.com/cloudcomputing/calendar/13911740/
Paco Nathan
@pacoid
http://ceteri.blogspot.com/
Examples of Hadoop Streaming, based on Python scripts
running on the AWS Elastic MapReduce service.
All source code for this talk is available at:
http://github.com/ceteri/ceteri-mapred