Identifying Pricing Request Emails Using Apache Spark and Machine Learning
Enron Email data set mining
1. Author-Avik Das and Jagriti Das- University of Connecticut
Dr. Fei Wang
Email data mining
2. Enron email dataset --- SQL tables
Enron email
dataset
Enron email dataset- SQL
dump
Refined SQL dump eliminating the
noise and refining it into multiple
views
Views that contain no of
messages sent across
year 200, 2001,2002
Views that contain
no of messages
sent across year
200, 2001,2002 to
external entities
View containing
the roles for
each employee
Views that contain
no of messages
sent across year
200, 2001,2002 to
lawyers
Noise:
• Employees having multiple
email ids
• Presence of records of some
other persons other than
the list of 151 employees in
the fact sheet
3. First Slide- Primary schema
Second slide- Role view based on Primary schema
Third slide- Sent Messages Views based on Primary schema
Fourth slide- Sent Messages Views based on Primary schema
Fifth Slide- Sent Messages Views based on Primary schema
Database scheme
8. -for more info…List location or contact for specification (or other related documents)
send_law_2002
Message
send_law_2001
send_law_2000
rvaluePK
date
count
MidPK
Sender
date
message_id
subject
body
folder
Contains count of messages sent on the year 2002
to lawyers
rvaluePK
date
count
Contains count of messages sent on the year 2001
external to lawyers
rvaluePK
date
count
Contains count of messages sent on the year 2000 external to lawyers
9. Send Matrix
Send Matrix A[i,j]
i- employee id of sender
j-employee id of receiver
Sender
ID
Sender Mail Receiver
ID
Receiver ID
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
SQL dump
Sender
ID
Sender Mail Receiver
ID
Receiver ID
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
SQL dump- Noise
10. Receive Matrix
Receive Matrix H[i,j]
i- employee id of receiver
j-employee id of sender
Receiver ID Receiver Mail Send ID Send ID
6 taylor@enron.com 4 teb.lokey@enron.com
6 taylor@enron.com 4 teb.lokey@enron.com
SQL dump
Receiver
ID
Receiver Mail Send ID Send ID
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
SQL dump- Noise
11. -for more info…List location or contact for specification (or other related documents)
Steps to find the CEO_ Step 1
From receive matrix for each
row/receiver find the
sender/senders who have
send the minimum mails
For employee
2/row 2 the
minimum no is
zero and is found
at col :1,4,5
Receive matrix
New matrix- C
Replacing all the
minimum values
with 999 and other
values with 0 for a
row
12. Steps to find the CEO_ Step 2
From the new matrix C, find for
each employee how many times it
was voted as parent
Find the number of 999s
present in a column, that
would give one how may
times the employee was
voted as parent
Find the maximum number of
999s for all the columns
13. Steps to find the CEO_Step 3
Get the maximum number of times
an employee could be voted as
parent
The maximum value
comes around as 150
New send index
matrix D- D[I,j]
i- employee id
J- no of times it
got voted as
parent
14. Final Step for CEO
Several employees
have the maximum
value voted as parent
Noise
Send index matrix
Fact sheet
Emp-ID : 129
Jeffrey skilling
Emp-ID : 127
Kenneth Lay
Eliminating noise
16. Levels of Hierarchy First Children
For each employee get the
maximum number of
messages sent
For emp-1, the maximum number of
messages sent was 3 so the possible first
child would be 73
Send Matrix First Child
17. Levels of Hierarchy Second Children
For each employee get the
second maximum number of
messages sent
For emp-1, the maximum number of
messages sent was 3 and the next
maximum value was 2 so the possible first
children would be 17,53
Send Matrix Second Child
18. Levels of Hierarchy Third Children
For each employee get the
third maximum number of
messages sent
For emp-2, the third maximum number of
messages sent was 44 so the possible first
children would be 19
Send Matrix
Third Child
19. Levels of Hierarchy Fourth Children
For each employee get the
fourth maximum number of
messages sent
For emp-2, the fourth maximum number of
messages sent was 29 so the possible first
children would be 4
Send Matrix
Fourth Child
20. Sample Level of Hierarchy
Jeffrey Skilling -
129
CEO
Kenneth Lay-
127
CEO
Greg Walley-54
President
John Arnold-44
Vice president
Jeffrey Shank
man- 36
President
Andy Zipper- 78
Vice president
John Lovarato-
53
CEO
Louise Kitchen
107
President
Barry Tycholitz-
38
Vice president
21. Network Graph _Between CEOs Jeffrey
Skilling
John
Lavratato
Kenneth
Lay
David
Delainey
The number of inter-communications
between CEO is quite less.
The network traffic is quite less with
respect to messages sent and received
22. Network Graph _Between managers/vice-presidents/presidents
Sample data
The number of inter-communications between mid level
employees increases as we go down the CEO level
23. Network Graph _Between employees
Sample data
The number of inter-communications between employees is
the highest amongst all the tiers.
24. Network Graph _Intra communication_CEO-Managers
Employee -Managers
CEO- Managers
The number of intra-communications between CEO level and
mid level employees is quite high
Managers- Employees
The number of intra-communications between Manager level
and lower level employees are highest
25. Ratio of communication between different levels
CEO
Mnager
Employee
CEO
Manager
Employee
Manager->CEOCEO->Manager
Manager->Employee Employee->CEO
26. Sub send Matrices
Compute ratio of total number of messages sent for the different sent
matrices.
Ratio of communication between different levels
Send Matrix
Function
CEO->manager
Manager->CEO
Manager -
employee
Employee->
Manager
28. Detection of Anomalous Behavior in Employees
Database
Send matrix
2000
Send matrix
2001 eid
Number of
messages sent
29. 2000 2001
Emails sent to Emails sent to
lawyer+Trader lawyer + Trader
Percentage change above threshold=25%?
class learnt concept-clustering
Detection of Anomalous Behavior in Employees
31. Email Stats over the year 2000 and 2001 for low/mid level employees
32. -for more info…List location or contact for specification (or other related documents)
Higher level employees
Email Stats over the year 2000 and 2001 for high level employees