Hadoop is used at Salesforce for several big data use cases including product metrics, user behavior analysis, capacity planning, and collaborative filtering. For product metrics, Hadoop collects and analyzes log data from over 130,000 customers to track feature usage, standard metrics, and metrics across channels. It generates reports and dashboards to provide insights to executives and product managers.
Hadoop is the technology of choice for processing large data sets. Force.com provides a great metadata layer to define Hadoop jobs, and store job output (Custom Objects). Force.com also comes with a great visualization layer (Reports & Dashboards) to chart & trend the output from Hadoop jobs. In this session, we will explore a real life use case that combines these technologies to provide a compelling big data processing framework.
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms.
In this webinar, you will learn about an internal use case and a product use case:
:: Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
:: Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
Hadoop is the technology of choice for processing large data sets. Force.com provides a great metadata layer to define Hadoop jobs, and store job output (Custom Objects). Force.com also comes with a great visualization layer (Reports & Dashboards) to chart & trend the output from Hadoop jobs. In this session, we will explore a real life use case that combines these technologies to provide a compelling big data processing framework.
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms.
In this webinar, you will learn about an internal use case and a product use case:
:: Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
:: Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
I have compiled a list of FAQs I have received during several customer meetings I had on SAP HANA. They are pretty basic but cover most of the baseline. I have also presented this during the Technology & Mobility Forum in Istanbul on 1st June 2012.
In this presentation you will learn about the enhancements and new capabilities of SAP HANA View Modeling. A specific focus targets Calculation View Modeling capabilities in SAP HANA Studio as well as the SAP HANA Web-based Development Workbench. Further conversion tools for Attribute- and Analytic Views will be introduced and we will outline the Calculation View StarJoin multidimensional scenario functional- as well as analytic processing-capabilities.
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
Wonderware has joined forces with EmsPT to bring you a webinar that discusses how to make more of your Historian installation and DRIVE a Programme of Continuous Improvement.
Wonderware recently surveyed a select number of Historian customers and there were some common themes that emerged. It is apparent that users want to:
Understand their production process by using historical data to analyse process and production issues
DRIVE Continuous Improvement activities and obtain data for production KPI's
Expand their system to drive performance improvements
Together Wonderware and EmsPT have already worked with many Wonderware Historian customers to ensure their goals were fully understood and beneficial use was realised. We would like to share these advantages of optimising a Historian installation to DRIVE Continuous Improvements with you.
Webinar Content:
Summary of Wonderware Customer Research
The need for Manufacturing Information to enable Continuous Improvement
Expanding on your Wonderware Historian Investment
Who uses Manufacturing Information?
Delivering Manufacturing Information for enhanced:
-Efficiency
-KPI's
-Quality
-Schedule Adherence
-Yield
-OEE
A proven Methodology for Success
Skelta is a product company founded in 2003, headquartered in Bangalore with its sales headquarters in Boston, USA.
Skelta’s highly innovative flagship product BPM.NET ™ specializes in enterprise wide Business Process Management (BPM) and Advanced workflow solutions for small to large sized enterprises worldwide. It is the world’s first 100% embeddable BPM and advanced workflow framework built on .NET technology.
Skelta provides BPM solutions which integrate between system to system, system to human and Human Workflow Solutions for Business Users, Power Users, and Developers for providing BPM functionalities inside existing applications, making it an excellent candidate for OEMing applications that require BPM functionality. Skelta BPM.NET™ particularly integrates well with products based on Microsoft Technologies. Skelta is also utilized as a Business Application Platform to build horizontal solutions like such as Accounts Payable Solution, Document Management for Paperless Processes, Corporate Governance, and Human Resource Information System for various industries ranging from Aerospace and Defense, Automotive, Retail, Government, Healthcare, Finance and many more.
I have compiled a list of FAQs I have received during several customer meetings I had on SAP HANA. They are pretty basic but cover most of the baseline. I have also presented this during the Technology & Mobility Forum in Istanbul on 1st June 2012.
In this presentation you will learn about the enhancements and new capabilities of SAP HANA View Modeling. A specific focus targets Calculation View Modeling capabilities in SAP HANA Studio as well as the SAP HANA Web-based Development Workbench. Further conversion tools for Attribute- and Analytic Views will be introduced and we will outline the Calculation View StarJoin multidimensional scenario functional- as well as analytic processing-capabilities.
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
Wonderware has joined forces with EmsPT to bring you a webinar that discusses how to make more of your Historian installation and DRIVE a Programme of Continuous Improvement.
Wonderware recently surveyed a select number of Historian customers and there were some common themes that emerged. It is apparent that users want to:
Understand their production process by using historical data to analyse process and production issues
DRIVE Continuous Improvement activities and obtain data for production KPI's
Expand their system to drive performance improvements
Together Wonderware and EmsPT have already worked with many Wonderware Historian customers to ensure their goals were fully understood and beneficial use was realised. We would like to share these advantages of optimising a Historian installation to DRIVE Continuous Improvements with you.
Webinar Content:
Summary of Wonderware Customer Research
The need for Manufacturing Information to enable Continuous Improvement
Expanding on your Wonderware Historian Investment
Who uses Manufacturing Information?
Delivering Manufacturing Information for enhanced:
-Efficiency
-KPI's
-Quality
-Schedule Adherence
-Yield
-OEE
A proven Methodology for Success
Skelta is a product company founded in 2003, headquartered in Bangalore with its sales headquarters in Boston, USA.
Skelta’s highly innovative flagship product BPM.NET ™ specializes in enterprise wide Business Process Management (BPM) and Advanced workflow solutions for small to large sized enterprises worldwide. It is the world’s first 100% embeddable BPM and advanced workflow framework built on .NET technology.
Skelta provides BPM solutions which integrate between system to system, system to human and Human Workflow Solutions for Business Users, Power Users, and Developers for providing BPM functionalities inside existing applications, making it an excellent candidate for OEMing applications that require BPM functionality. Skelta BPM.NET™ particularly integrates well with products based on Microsoft Technologies. Skelta is also utilized as a Business Application Platform to build horizontal solutions like such as Accounts Payable Solution, Document Management for Paperless Processes, Corporate Governance, and Human Resource Information System for various industries ranging from Aerospace and Defense, Automotive, Retail, Government, Healthcare, Finance and many more.
SnapLogic provides a Data Integration platform that takes integration to another level, by combining the power of dynamic programming languages with standard Web interfaces to solve today's most pressing problems in application integration. SnapLogic has an intuitive visual designer that runs in your browser and connects to highly scalable web based Integration server that you can run on premise or in the cloud.
21st Century Service Oriented ArchitectureBob Rhubart
Service Oriented Architecture has evolved from concept to reality in the last decade. The right methodology coupled with mature SOA technologies has helped customers demonstrate success in both innovation and ROI. In this session you will learn how Oracle SOA Suite’s orchestration, virtualization, and governance capabilities provide the infrastructure to run mission critical business and system applications. And we’ll take a special look at the convergence of SOA & BPM using Oracle’s Unified technology stack.
(As presented by Samrat Ray at Oracle Technology Network Architect Day in Chicago, October 24, 2011.)
Be the Data Hero in Your Organization with SAP and CA Analytic SolutionsCA Technologies
Analytics extends traditional Business Intelligence to encompass Agile Visualization and Advanced Analytics. SAP Analytics is deeply embedded into many CA applications. In this presentation, learn what is currently available within CA applications and look towards the future with “the art of the possible” in upcoming analytic innovations under review by CA Technologies.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Monitoring Java Application Security with JDK Tools and JFR Events
Hadoop Summit San Diego Feb2013
1. Hadoop
Use
Cases
At
Salesforce.com
Narayan
Bharadwaj
Director,
Product
Management
Monitoring
&
Big
Data
Salesforce.com
@nadubharadwaj
2. Safe
harbor
Safe
harbor
statement
under
the
Private
Securi8es
Li8ga8on
Reform
Act
of
1995:
This
presenta8on
may
contain
forward-‐looking
statements
that
involve
risks,
uncertain8es,
and
assump8ons.
If
any
such
uncertain8es
materialize
or
if
any
of
the
assump8ons
proves
incorrect,
the
results
of
salesforce.com,
inc.
could
differ
materially
from
the
results
expressed
or
implied
by
the
forward-‐looking
statements
we
make.
All
statements
other
than
statements
of
historical
fact
could
be
deemed
forward-‐looking,
including
any
projec8ons
of
product
or
service
availability,
subscriber
growth,
earnings,
revenues,
or
other
financial
items
and
any
statements
regarding
strategies
or
plans
of
management
for
future
opera8ons,
statements
of
belief,
any
statements
concerning
new,
planned,
or
upgraded
services
or
technology
developments
and
customer
contracts
or
use
of
our
services.
The
risks
and
uncertain8es
referred
to
above
include
–
but
are
not
limited
to
–
risks
associated
with
developing
and
delivering
new
func8onality
for
our
service,
new
products
and
services,
our
new
business
model,
our
past
opera8ng
losses,
possible
fluctua8ons
in
our
opera8ng
results
and
rate
of
growth,
interrup8ons
or
delays
in
our
Web
hos8ng,
breach
of
our
security
measures,
the
outcome
of
intellectual
property
and
other
li8ga8on,
risks
associated
with
possible
mergers
and
acquisi8ons,
the
immature
market
in
which
we
operate,
our
rela8vely
limited
opera8ng
history,
our
ability
to
expand,
retain,
and
mo8vate
our
employees
and
manage
our
growth,
new
releases
of
our
service
and
successful
customer
deployment,
our
limited
history
reselling
non-‐salesforce.com
products,
and
u8liza8on
and
selling
to
larger
enterprise
customers.
Further
informa8on
on
poten8al
factors
that
could
affect
the
financial
results
of
salesforce.com,
inc.
is
included
in
our
annual
report
on
Form
10-‐Q
for
the
most
recent
fiscal
quarter
ended
July
31,
2012.
This
documents
and
others
containing
important
disclosures
are
available
on
the
SEC
Filings
sec8on
of
the
Investor
Informa8on
sec8on
of
our
Web
site.
Any
unreleased
services
or
features
referenced
in
this
or
other
presenta8ons,
press
releases
or
public
statements
are
not
currently
available
and
may
not
be
delivered
on
8me
or
at
all.
Customers
who
purchase
our
services
should
make
the
purchase
decisions
based
upon
features
that
are
currently
available.
Salesforce.com,
inc.
assumes
no
obliga8on
and
does
not
intend
to
update
these
forward-‐looking
statements.
3. Agenda
• Technology
• Big
Data
use
cases
• Use
case
discussion
• Q&A
4. Got
“Cloud
Data”?
130k
customers
1
billion
transac8ons/day
Millions
of
users
Terabytes/day
7. Phoenix
“We
put
the
SQL
back
in
NoSQL”
• SQL
layer
on
HBase
• Seamless
applica8on
integra8on
– Standard
JDBC
interface
– DDL
statement
support
• Low
query
latency
– SQL
query
è
Mul8ple
HBase
scans
– Co-‐processors,
custom
filters
– Milliseconds
for
small
queries
– Seconds
for
tens
of
millions
rows
• hdps://github.com/forcedotcom/phoenix
8. Contribu8ons
@pRaShAnT1784
:
Prashant
Kommireddi
Lars
Ho<ansl
@thefutureian
:
Ian
Varley
10. Big
Data
Use
Cases
User
behavior
Product
Metrics
Capacity
planning
analysis
Monitoring
Query
Run8me
Collec8ons
intelligence
Predic8on
Early
Warning
Collabora8ve
Search
Relevancy
System
Filtering
Internal
App
Product
feature
12. Product
Metrics
–
Problem
Statement
• Track
feature
usage/adop8on
across
130k+
customers
– Eg:
Accounts,
Contacts,
Visualforce,
Apex,…
• Track
standard
metrics
across
all
features
– Eg:
#Requests,
#UniqueOrgs,
#UniqueUsers,
AvgResponseTime,…
• Track
features
and
metrics
across
all
channels
– API,
UI,
Mobile
• Primary
audience:
Execu8ves,
Product
Managers
13. Product
Metrics
Pipeline
User
Input
CollaboraWon
Reports,
Dashboards
(Page
Layout)
(ChaXer)
Workflow
Formula
Fields
Feature
Metrics
Trend
Metrics
(Custom
Object)
(Custom
Object)
API
API
Client
Machine
Java
Program
Pig
script
generator
Workflow
Log
Pull
Hadoop
Log
Files
18. Problem
Statement
§ How
do
we
reduce
number
of
clicks
on
the
user
interface?
§ What
are
the
top
user
click
path
sequences?
§ What
are
the
user
clusters/personas?
• Approach:
• Markov
transi8on
for
click
path,
D3.js
visuals
• K-‐means
(unsupervised)
clustering
for
user
groups
25. We
found
this
relaWonship
using
item-‐to-‐item
collaboraWve
filtering
• Amazon
published
this
algorithm
in
2003.
– Amazon.com
RecommendaJons:
Item-‐to-‐Item
CollaboraJve
Filtering,
by
Gregory
Linden,
Brent
Smith,
and
Jeremy
York.
IEEE
Internet
Compu8ng,
January-‐February
2003.
• At
Salesforce,
we
adapted
this
algorithm
for
Hadoop,
and
we
use
it
to
recommend
files
to
view
and
users
to
follow.
26. Example:
CF
on
5
files
Vision
Statement
Annual
Report
Dilbert
Comic
Darth
Vader
Cartoon
Disk
Usage
Report
27. View
History
Table
Darth
Annual
Vision
Dilbert
Disk
Usage
Vader
Report
Statement
Cartoon
Report
Cartoon
Miranda
1
1
1
0
0
(CEO)
Bob
(CFO)
1
1
1
0
0
Susan
0
1
1
1
0
(Sales)
Chun
0
0
1
1
0
(Sales)
Alice
(IT)
0
0
1
1
1
28. RelaWonships
between
the
files
Annual
Report
Vision
Statement
Darth
Vader
Cartoon
Dilbert
Cartoon
Disk
Usage
Report
29. RelaWonships
between
the
files
Annual
Report
2 Vision
Statement
0 1
3
2
0 Darth
Vader
0 Cartoon
Dilbert
Cartoon
3
1
1
Disk
Usage
Report
30. Sorted
relaWonships
for
each
file
Annual
Vision
Dilbert
Darth
Disk
Usage
Report
Statement
Cartoon
Vader
Report
Cartoon
Dilbert
(2)
Dilbert
(3)
Vision
Stmt.
(3)
Dilbert
(3)
Dilbert
(1)
Vision
Stmt.
(2)
Annual
Rpt.
(2)
Darth
Vader
(3)
Vision
Stmt.
(1)
Darth
Vader
(1)
Darth
Vader
(1)
Annual
Rpt.
(2)
Disk
Usage
(1)
Disk
Usage
(1)
The
popularity
problem:
no8ce
that
Dilbert
appears
first
in
every
list.
This
is
probably
not
what
we
want.
The
solu8on:
divide
the
relaWonship
tallies
by
file
populariWes.
31. Normalized
relaWonships
between
the
files
Annual
Report
.82
Vision
Statement
0 .33
.63
.77
0
0 Darth
Vader
Cartoon
Dilbert
Cartoon
.77
.58
.45
Disk
Usage
Report
32. Sorted
relaWonships
for
each
file,
normalized
by
file
populariWes
Annual
Vision
Dilbert
Darth
Vader
Disk
Usage
Report
Statement
Cartoon
Cartoon
Report
Vision
Stmt.
Annual
Report
Darth
Vader
Darth
Vader
Dilbert
(.77)
(.82)
(.82)
(.77)
(.58)
Vision
Stmt.
Disk
Usage
Dilbert
Dilbert
(.63)
Dilbert
(.77)
(.77)
(.58)
(.45)
Darth
Vader
Annual
Report
Vision
Stmt.
(.33)
(.63)
(.33)
Disk
Usage
(.45)
High
rela8onship
tallies
AND
similar
popularity
values
now
drive
closeness.
33. The
item-‐to-‐item
CF
algorithm
1) Compute
file
populari8es
2) Compute
rela8onship
tallies
and
divide
by
file
populari8es
3) Sort
and
store
the
results
34. MapReduce
Overview
Map
Shuffle
Reduce
(adapted
from
hdp://code.google.com/p/mapreduce-‐framework/wiki/
MapReduce)
35. 1.
Compute
File
PopulariWes
<user,
file>
Inverse
iden8ty
map
<file,
List<user>>
Reduce
<file,
(user
count)>
Result
is
a
table
of
(file,
popularity)
pairs
that
you
store
in
the
Hadoop
distributed
cache.
39. 2b.
Tally
the
relaWonship
votes
-‐
just
a
word
count,
where
each
relaWonship
occurrence
is
a
word
<(file1,
file2),
Integer(1)>
Iden8ty
map
<(file1,
file2),
List<Integer(1)>
Reduce:
count
and
divide
by
populari8es
<file1,
(file2,
similarity
score)>,
<file2,
(file1,
similarity
score)>
Note
that
we
emit
each
result
twice,
one
for
each
file
that
belongs
to
a
rela8onship.
40. Example
2b:
the
Dilbert/Darth
Vader
relaWonship
<(Dilbert,
Vader),
Integer(1)>,
<(Dilbert,
Vader),
Integer(1)>,
<(Dilbert,
Vader),
Integer(1)>
Iden8ty
map
<(Dilbert,
Vader),
{1,
1,
1}>
Reduce:
count
and
divide
by
populari8es
<Dilbert,
(Vader,
sqrt(3/5))>,
<Vader,
(Dilbert,
sqrt(3/5))>
41. 3.
Sort
and
store
results
<file1,
(file2,
similarity
score)>
Iden8ty
map
<file1,
List<(file2,
similarity
score)>>
Reduce
<file1,
{top
n
similar
files}>
Store
the
results
in
your
loca8on
of
choice
43. Appendix
• Cosine
formula
and
normaliza8on
trick
to
avoid
the
distributed
cache
A• B A B
cosθ AB = = •
A B A B
• Mahout
has
CF
• Asympto8c
order
of
the
algorithm
is
O(M*N2)
€
in
worst
case,
but
is
helped
by
sparsity.