Mahout is a machine learning library that provides implementations of common machine learning algorithms like recommender systems, clustering, and classification. It started as a subproject of Apache Lucene in 2008 and became a top-level Apache project in 2010. Mahout algorithms can run locally or on Hadoop for distributed processing of large datasets. Key areas Mahout supports include recommender systems, clustering algorithms like k-means, and classification using algorithms like naive Bayes.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This slide deck is used as an introduction to the MapReduce programming model, trying hard to be Hadoop-agnostic, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
This slide deck is used as an introduction to Relational Algebra and its relation to the MapReduce programming model, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
This slide deck is used as an introduction to the MapReduce programming model, trying hard to be Hadoop-agnostic, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
This slide deck is used as an introduction to Relational Algebra and its relation to the MapReduce programming model, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
A map showing the proposed route of an extension to the Auburn Line natural gas pipeline that will bring locally produced Marcellus Shale gas from northeastern PA to customers of UGI also located in northeast PA, in the Wilkes-Barre/Scranton area. The cost to ship the locally produced gas is 20-25% less than the current method of obtaining it from the Gulf of Mexico. Work on the pipeline is due to begin and be completed in 2013.
TorchFi is a digital engagement platform for real world businesses to engage with their customers inside their outlets. TorchFi enables the engagement through in-store WiFi and can identify loyal customers to deliver personalized experience
Digital Citizenship & Internet Maturity for SchoolsRaghu Pandey
Digital Citizenship & Internet Maturity are critical life skills for the 21st century which schools MUST TEACH to their students. Watch this presentation to understand how important DCIM is for schools to include in their curriculum.
Key lecture for the EURO-BASIN Training Workshop on Introduction to Statistical Modelling for Habitat Model Development, 26-28 Oct, AZTI-Tecnalia, Pasaia, Spain (www.euro-basin.eu)
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
There is more to recommendation algorithms than rating prediction. And, there is more to recommender systems than algorithms. In this tutorial, given at the 2012 ACM Recommender Systems Conference in Dublin, I review things such as different interaction and user feedback mechanisms, offline experimentation and AB testing, or software architectures for Recommender Systems.
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Roku
Linked Open Data to Support Content-based Recommender Systems
Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, Markus Zanker
I-SEMANTICS 2012, 8th Int. Conf. on Semantic Systems, Sept. 5-7, 2012, Graz, Austria.
The World Wide Web is moving from a Web of hyper-linked Documents to a Web of linked Data. Thanks to the Semantic Web spread and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets. These datasets are connected with each other to form the so called Linked Open Data cloud. As of today, there are tons of RDF data available in the Web of Data, but only few applications really exploit their potential power. In this paper we show how these data can successfully be used to develop a recommender system (RS) that relies exclusively on the information encoded in the Web of Data. We implemented a content-based RS that leverages the data available within Linked Open Data datasets (in particular DBpedia, Freebase and LinkedMDB) in order to recommend movies to the end users. We extensively evaluated the approach and validated the effectiveness of the algorithms by experimentally measuring their accuracy with precision and recall metrics.
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
Presentation for "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
See paper: http://ceur-ws.org/Vol-1653/paper_11.pdf
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
Slides for the presentation of the paper "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
2. Mahout: Brief History
• Started in 2008 as a subproject of the Apache
Lucene;
– Text mining, clustering, some classification.
• Sean Owen started Taste in 2005:
– Recommender engine for business that never took off
– Mahout community asked to merge in Taste code
• Became a top level Apache Project in April 2010
• Lineage resulted in a fragmented framework.
NICTA Copyright 2011 From imagination to impact 2
3. Mahout: What is it?
• Collection of machine learning algorithm implementations:
– Many (not all) implemented on Hadoop map-reduce;
– Java library with handy command line interface to run common tasks.
• Currently serves 3 key areas:
– Recommendation engines
– Clustering
– Classification
• Focus of today’s talk is on functionality accessible from command
line interface:
– Most accessible for Hadoop beginners.
NICTA Copyright 2011 From imagination to impact 3
4. Recommenders
• Supports user based and item based
collaborative filtering:
– User based: similarity between users;
– Item based: similarity between items user
other items 1 2 3 user 3 likes item E
user 1 may
like
item
A B C D E F
4 5 6
NICTA Copyright 2011 From imagination to impact 4
5. Implementations
• Non-distributed (no Hadoop requirement)
– The ‘Taste’ code, supports item and user based;
– Good for up to 100 million user-item associations;
– Faster than distributed version.
• Distributed (Hadoop MapReduce)
– Item based using similarity measure (configurable)
between items.
– Latent factor based:
• Estimates ‘genres’ of items from user preferences
• Similar to entry that won the NetFlix prize.
– Both have command line interfaces.
NICTA Copyright 2011 From imagination to impact 5
6. Distributed Item Recommender
Item1 Item 2 Item 3 Item n
R R R
User 1
Similarity
User 2 R R R
calculation
User 3 R R R R
R R
1 2 n
R R 1 1 .2 … .8
2 1 … .5
R R
… .6
R R R R
User m n 1
User-item ratings matrix Item similarity
NICTA Copyright 2011 From imagination to impact matrix 6
7. Distributed Item Recommendation
csv file:
csv file:
user, item, rating
item, item, simularity
…
mahout itemsimilarity –i <input_file> -o <output_path> …!
Item1 Item 2 Item 3 Item n
User 1 R R R
User 2 R R R
User 3 R R R R s2,3 * R2 + s3,5 * R5
R2 R3? R5
R3? =
s2,3 + s3,5
R R
R R
User m R R R R
€
NICTA Copyright 2011 From imagination to impact 7
8. Distributed Item Recommendation
• Can perform item similarity and
recommendation generation in a single
call: csv file (tab seperated):
user, item, rating
…
mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>! csv file (tab separated):
--numRecommendations …! user,item,score,item,score,…
csv file (tab
Number of
separated):
recommendations to user
return per user
user
user
NICTA Copyright 2011 From imagination to impact 8
9. Clustering
Vectorization
Clustering
NICTA Copyright 2011 From imagination to impact 9
10. Clustering
• Don’t know the structure of data, want to sensibly group
things together.
• A number of distributed algorithms supported:
– Canopy Clustering (MAHOUT-3 – integrated)
– K-Means Clustering (MAHOUT-5 – integrated)
– Fuzzy K-Means (MAHOUT-74 – integrated)
– Expectation Maximization (EM) (MAHOUT-28)
– Mean Shift Clustering (MAHOUT-15 – integrated)
– Hierarchical Clustering (MAHOUT-19)
– Dirichlet Process Clustering (MAHOUT-30 – integrated)
– Latent Dirichlet Allocation (MAHOUT-123 – integrated)
– Spectral Clustering (MAHOUT-363 – integrated)
– Minhash Clustering (MAHOUT-344 - integrated)
• Some have command line interface support.
NICTA Copyright 2011 From imagination to impact 10
11. Vectorization
• Data specific.
– Majority of cases need to write a map-reduce job to generate
vectorized input
– Input formats are still not uniform across Mahout.
– Most clustering implementations expect:
• SequenceFile(WritableComparable, VectorWritable)!
• Note: key is ignored.
• Mahout has some support for clustering text documents:
– Can generate n-gram Term Frequency-Inverse Document
Frequency (TF-IDF) from a directory of text documents;
– Enables text documents to be clustered using command line
interface.
NICTA Copyright 2011 From imagination to impact 11
12. Text Document Vectorization
Term frequency
The | conduct | as | run | doctor | with | a | Patel!
47 | 3 | 7 | 5 | 8 | 12 | 54| 6 !
Document
Document frequency
The | conduct | as | run | doctor | with | a | Patel!
1000| 198 | 999| 567 | 48 | 998 |100| 3 !
N
Corpus TFIDFi = TFi * log
DFi
Unigram: (crude)!
Increase weight of less
Bi-gram: (crude, oil)!
common words/n-
Tri-gram: (crude, oil, prices)!
grams within corpus
NICTA Copyright 2011 From imagination to impact 12
€
13. Text Document Vectorization
Directory of Sequence
plain text file: <name,
documents text body>
mahout seqdirectory –i <input_path> -o <seq_output_path>!
Dictionary
Term Frequency file
Ngram
TF-IDF
generation
Vec. Gen.
Inverse
Document Freq.
mahout seq2sparse -i <seq_input_path> -o <output_path> !
NICTA Copyright 2011 From imagination to impact 13
15. Classification
• Train the machine to provide discrete answers to a
specific question.
Mahout supports the
100100: A following algorithms:
010011: A Model - ogistic Regression
L
010110: B Training - aïve Bayes
N
100101: A Algorithm - andom Forests
R
010101: B Others in development
Data with known
answers
101100: ? 101100: A
010010: ? 000110: B
001000: ? Trained
011100: A
100000: ? Model
101101: A
001001: ? 010111: B
Data without Data with
answers estimated
NICTA Copyright 2011 From imagination to impact answers 15
16. Classification Workflow
Label Sample
~90%
Training set 1
Model
Vectorize Sample Training
~10%
2
Model
Testing
3
Input set A, B, A,
Vectorize A, B, …
Trained Model Label
NICTA Copyright 2011 From imagination to impact 16
approximation
17. Feature Extraction
• Good feature extraction is critical to
trained model performance:
– Need domain understanding to ‘measure’ the
right things.
– Measure wrong things, even the best model
will perform badly.
– Caution needed to avoid ‘label leaks’.
• Will typically require hand written map-
reduce code:
– If text based, can use text mining tools in
HIVE or Mahout.
NICTA Copyright 2011 From imagination to impact 17
18. Naïve Bayes Classifier
Feature vector Label
n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
l i=1
Probability of feature i having value fi given l,
e.g. assume a Gaussian pdf:
(v −µl )2
1 −
2σ l2
P ( f = v | l) = e
2πσ l2
Note: Model training boils down to estimating the conditional variance of the
feature vector elements. This can be trivially parallelized and implemented in
map reduce.
€ 18
NICTA Copyright 2011 From imagination to impact
19. Naïve Bayes in Mahout
Command line specific to text classification (e.g. SPAM detection, document
classification, etc.)
Plain text file, format: Generated model, set
label t word word of files in sequence file
word … format (variances).
mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …
N_gram Discard n-grams
Discard n-grams
size, that occur less
that occur in less
default = 1 than this number
than this number
of times in a
of documents.
document.
NICTA Copyright 2011 From imagination to impact 19
20. Naïve Bayes in Mahout
• Need to write your own classifier to be
practical.
document
Classifier Trained Model
label
Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …
NICTA Copyright 2011 From imagination to impact 20
21. Classification vs. Recommendation
• Can use a classifier to recommend:
– Interested in item or not interested?
• Classifier is based on features of the
specific item and the customer
• Recommendation based on past behavior
of customers
• Classification: single decisions
• Recommendation: ranking
NICTA Copyright 2011 From imagination to impact 21