SlideShare a Scribd company logo
1 of 36
Author-Avik Das and Jagriti Das- University of Connecticut
Dr. Fei Wang
Email data mining
Enron email dataset --- SQL tables
Enron email
dataset
Enron email dataset- SQL
dump
Refined SQL dump eliminating the
noise and refining it into multiple
views
Views that contain no of
messages sent across
year 200, 2001,2002
Views that contain
no of messages
sent across year
200, 2001,2002 to
external entities
View containing
the roles for
each employee
Views that contain
no of messages
sent across year
200, 2001,2002 to
lawyers
Noise:
• Employees having multiple
email ids
• Presence of records of some
other persons other than
the list of 151 employees in
the fact sheet
 First Slide- Primary schema
 Second slide- Role view based on Primary schema
 Third slide- Sent Messages Views based on Primary schema
 Fourth slide- Sent Messages Views based on Primary schema
 Fifth Slide- Sent Messages Views based on Primary schema
Database scheme
employeelist
Message
recepient info
Email_idPK
First Name
Second name
Eid
MidPK
Sender
date
message_id
subject
body
folder
ridPK
midFK
rvalue
rtype
date
The MID in message is present as a foreign key in recipient info
employeelist
RoleEmail_idPK
First Name
Second name
Eid
The role view maps the first and last name
for each emp id
EidPK
First NameFK
Second NameFK
Role
Fact Sheet
The role view maps the role from the fact sheet
send2002
Message
send2001
send2000
rvaluePK
date
count
MidPK
Sender
date
mes sage_id
subject
body
folder
Contains count of messages sent on the year 2002
rvaluePK
date
count
Contains count of messages sent on the year 2001
rvaluePK
date
count
Contains count of messages sent on the year 2000
send_ext_2002
Message
send_ext_2001
send_ext_2000
rvaluePK
date
count
MidPK
Sender
date
message_id
subject
body
folder
Contains count of messages sent on the year 2002
external to enron
rvaluePK
date
count
Contains count of messages sent on the year 2001
external to enrron
rvaluePK
date
count
Contains count of messages sent on the year 2000 external to Enron
-for more info…List location or contact for specification (or other related documents)
send_law_2002
Message
send_law_2001
send_law_2000
rvaluePK
date
count
MidPK
Sender
date
message_id
subject
body
folder
Contains count of messages sent on the year 2002
to lawyers
rvaluePK
date
count
Contains count of messages sent on the year 2001
external to lawyers
rvaluePK
date
count
Contains count of messages sent on the year 2000 external to lawyers
Send Matrix
Send Matrix A[i,j]
i- employee id of sender
j-employee id of receiver
Sender
ID
Sender Mail Receiver
ID
Receiver ID
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
3 tracy.geaccone@enron.com 4 teb.lokey@enron.com
SQL dump
Sender
ID
Sender Mail Receiver
ID
Receiver ID
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
SQL dump- Noise
Receive Matrix
Receive Matrix H[i,j]
i- employee id of receiver
j-employee id of sender
Receiver ID Receiver Mail Send ID Send ID
6 taylor@enron.com 4 teb.lokey@enron.com
6 taylor@enron.com 4 teb.lokey@enron.com
SQL dump
Receiver
ID
Receiver Mail Send ID Send ID
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com
SQL dump- Noise
-for more info…List location or contact for specification (or other related documents)
Steps to find the CEO_ Step 1
From receive matrix for each
row/receiver find the
sender/senders who have
send the minimum mails
For employee
2/row 2 the
minimum no is
zero and is found
at col :1,4,5
Receive matrix
New matrix- C
Replacing all the
minimum values
with 999 and other
values with 0 for a
row
Steps to find the CEO_ Step 2
From the new matrix C, find for
each employee how many times it
was voted as parent
Find the number of 999s
present in a column, that
would give one how may
times the employee was
voted as parent
Find the maximum number of
999s for all the columns
Steps to find the CEO_Step 3
Get the maximum number of times
an employee could be voted as
parent
The maximum value
comes around as 150
New send index
matrix D- D[I,j]
i- employee id
J- no of times it
got voted as
parent
Final Step for CEO
Several employees
have the maximum
value voted as parent
Noise
Send index matrix
Fact sheet
Emp-ID : 129
Jeffrey skilling
Emp-ID : 127
Kenneth Lay
Eliminating noise
Proposed Hierarchy from CEO find algorithm
Jeffrey Skilling
President &
CEO
Unknown
Unknown
Levels of Hierarchy First Children
For each employee get the
maximum number of
messages sent
For emp-1, the maximum number of
messages sent was 3 so the possible first
child would be 73
Send Matrix First Child
Levels of Hierarchy Second Children
For each employee get the
second maximum number of
messages sent
For emp-1, the maximum number of
messages sent was 3 and the next
maximum value was 2 so the possible first
children would be 17,53
Send Matrix Second Child
Levels of Hierarchy Third Children
For each employee get the
third maximum number of
messages sent
For emp-2, the third maximum number of
messages sent was 44 so the possible first
children would be 19
Send Matrix
Third Child
Levels of Hierarchy Fourth Children
For each employee get the
fourth maximum number of
messages sent
For emp-2, the fourth maximum number of
messages sent was 29 so the possible first
children would be 4
Send Matrix
Fourth Child
Sample Level of Hierarchy
Jeffrey Skilling -
129
CEO
Kenneth Lay-
127
CEO
Greg Walley-54
President
John Arnold-44
Vice president
Jeffrey Shank
man- 36
President
Andy Zipper- 78
Vice president
John Lovarato-
53
CEO
Louise Kitchen
107
President
Barry Tycholitz-
38
Vice president
Network Graph _Between CEOs Jeffrey
Skilling
John
Lavratato
Kenneth
Lay
David
Delainey
 The number of inter-communications
between CEO is quite less.
 The network traffic is quite less with
respect to messages sent and received
Network Graph _Between managers/vice-presidents/presidents
Sample data
 The number of inter-communications between mid level
employees increases as we go down the CEO level
Network Graph _Between employees
Sample data
 The number of inter-communications between employees is
the highest amongst all the tiers.
Network Graph _Intra communication_CEO-Managers
Employee -Managers
CEO- Managers
 The number of intra-communications between CEO level and
mid level employees is quite high
Managers- Employees
 The number of intra-communications between Manager level
and lower level employees are highest
Ratio of communication between different levels
CEO
Mnager
Employee
CEO
Manager
Employee
Manager->CEOCEO->Manager
Manager->Employee Employee->CEO
Sub send Matrices
Compute ratio of total number of messages sent for the different sent
matrices.
Ratio of communication between different levels
Send Matrix
Function
CEO->manager
Manager->CEO
Manager -
employee
Employee->
Manager
Output
Employee-sent Manager-response/sent CEO-response
4 2
4 1
Detection of Anomalous Behavior in Employees
Database
Send matrix
2000
Send matrix
2001 eid
Number of
messages sent
2000 2001
Emails sent to Emails sent to
lawyer+Trader lawyer + Trader
Percentage change above threshold=25%?
class learnt concept-clustering
Detection of Anomalous Behavior in Employees
Results-sample clusters
Below Threshold
samples
Above Threshold
samples
CEO/President
Email Stats over the year 2000 and 2001 for low/mid level employees
-for more info…List location or contact for specification (or other related documents)
 Higher level employees
Email Stats over the year 2000 and 2001 for high level employees
Temporal Analysis of emails sent for some high level employees
Temporal Analysis of emails sent for some high level employees
 Semantic analysis using the LIWC tool.
 Probabilistic dependency .
Future work
–Thank You
Questions???

More Related Content

Similar to Enron Email data set mining

How to Trace an E-mail Part 1
How to Trace an E-mail Part 1How to Trace an E-mail Part 1
How to Trace an E-mail Part 1
Lebowitzcomics
 
Mining Email Social Networks
Mining Email Social NetworksMining Email Social Networks
Mining Email Social Networks
arnamoy10
 
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
lokesh shanmuganandam
 

Similar to Enron Email data set mining (20)

How to Trace an E-mail Part 1
How to Trace an E-mail Part 1How to Trace an E-mail Part 1
How to Trace an E-mail Part 1
 
E-Mail Header- A Forensic Key to Examine an E-Mail
E-Mail Header- A Forensic Key to Examine an E-MailE-Mail Header- A Forensic Key to Examine an E-Mail
E-Mail Header- A Forensic Key to Examine an E-Mail
 
Mining Email Social Networks
Mining Email Social NetworksMining Email Social Networks
Mining Email Social Networks
 
Email
EmailEmail
Email
 
2. DD-sample.docx
2. DD-sample.docx2. DD-sample.docx
2. DD-sample.docx
 
E-Mail Security Protocol - 1 Privacy Enhanced Mail (PEM) Protocol
E-Mail Security Protocol - 1 Privacy Enhanced Mail (PEM) ProtocolE-Mail Security Protocol - 1 Privacy Enhanced Mail (PEM) Protocol
E-Mail Security Protocol - 1 Privacy Enhanced Mail (PEM) Protocol
 
Email Validation
Email ValidationEmail Validation
Email Validation
 
Data Entry Operator Certification
Data Entry Operator CertificationData Entry Operator Certification
Data Entry Operator Certification
 
IRJET- Corporate Message Filtration & Security Via 3-DES
IRJET-  	  Corporate Message Filtration & Security Via 3-DESIRJET-  	  Corporate Message Filtration & Security Via 3-DES
IRJET- Corporate Message Filtration & Security Via 3-DES
 
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
 
Scheduled delivery of a large amount of emails
Scheduled delivery of a large amount of emailsScheduled delivery of a large amount of emails
Scheduled delivery of a large amount of emails
 
Jipdec 20131216-english
Jipdec 20131216-englishJipdec 20131216-english
Jipdec 20131216-english
 
HMRS - Human Resource Management System (VB.Net)
HMRS - Human Resource Management System (VB.Net)HMRS - Human Resource Management System (VB.Net)
HMRS - Human Resource Management System (VB.Net)
 
Smtp relay in office 365 environment troubleshooting scenarios - part 4#4
Smtp relay in office 365 environment   troubleshooting scenarios - part 4#4Smtp relay in office 365 environment   troubleshooting scenarios - part 4#4
Smtp relay in office 365 environment troubleshooting scenarios - part 4#4
 
javascript.pptx
javascript.pptxjavascript.pptx
javascript.pptx
 
A JAVA project on Marriage bureau management system
A JAVA project on Marriage bureau management systemA JAVA project on Marriage bureau management system
A JAVA project on Marriage bureau management system
 
Email Headers – Expert Forensic Analysis
Email Headers – Expert Forensic AnalysisEmail Headers – Expert Forensic Analysis
Email Headers – Expert Forensic Analysis
 
Email Forensics
Email ForensicsEmail Forensics
Email Forensics
 
E mail - karthik krishna
E mail - karthik krishnaE mail - karthik krishna
E mail - karthik krishna
 
Identifying Pricing Request Emails Using Apache Spark and Machine Learning
Identifying Pricing Request Emails Using Apache Spark and Machine LearningIdentifying Pricing Request Emails Using Apache Spark and Machine Learning
Identifying Pricing Request Emails Using Apache Spark and Machine Learning
 

Enron Email data set mining

  • 1. Author-Avik Das and Jagriti Das- University of Connecticut Dr. Fei Wang Email data mining
  • 2. Enron email dataset --- SQL tables Enron email dataset Enron email dataset- SQL dump Refined SQL dump eliminating the noise and refining it into multiple views Views that contain no of messages sent across year 200, 2001,2002 Views that contain no of messages sent across year 200, 2001,2002 to external entities View containing the roles for each employee Views that contain no of messages sent across year 200, 2001,2002 to lawyers Noise: • Employees having multiple email ids • Presence of records of some other persons other than the list of 151 employees in the fact sheet
  • 3.  First Slide- Primary schema  Second slide- Role view based on Primary schema  Third slide- Sent Messages Views based on Primary schema  Fourth slide- Sent Messages Views based on Primary schema  Fifth Slide- Sent Messages Views based on Primary schema Database scheme
  • 4. employeelist Message recepient info Email_idPK First Name Second name Eid MidPK Sender date message_id subject body folder ridPK midFK rvalue rtype date The MID in message is present as a foreign key in recipient info
  • 5. employeelist RoleEmail_idPK First Name Second name Eid The role view maps the first and last name for each emp id EidPK First NameFK Second NameFK Role Fact Sheet The role view maps the role from the fact sheet
  • 6. send2002 Message send2001 send2000 rvaluePK date count MidPK Sender date mes sage_id subject body folder Contains count of messages sent on the year 2002 rvaluePK date count Contains count of messages sent on the year 2001 rvaluePK date count Contains count of messages sent on the year 2000
  • 7. send_ext_2002 Message send_ext_2001 send_ext_2000 rvaluePK date count MidPK Sender date message_id subject body folder Contains count of messages sent on the year 2002 external to enron rvaluePK date count Contains count of messages sent on the year 2001 external to enrron rvaluePK date count Contains count of messages sent on the year 2000 external to Enron
  • 8. -for more info…List location or contact for specification (or other related documents) send_law_2002 Message send_law_2001 send_law_2000 rvaluePK date count MidPK Sender date message_id subject body folder Contains count of messages sent on the year 2002 to lawyers rvaluePK date count Contains count of messages sent on the year 2001 external to lawyers rvaluePK date count Contains count of messages sent on the year 2000 external to lawyers
  • 9. Send Matrix Send Matrix A[i,j] i- employee id of sender j-employee id of receiver Sender ID Sender Mail Receiver ID Receiver ID 3 tracy.geaccone@enron.com 4 teb.lokey@enron.com 3 tracy.geaccone@enron.com 4 teb.lokey@enron.com 3 tracy.geaccone@enron.com 4 teb.lokey@enron.com 3 tracy.geaccone@enron.com 4 teb.lokey@enron.com 3 tracy.geaccone@enron.com 4 teb.lokey@enron.com 3 tracy.geaccone@enron.com 4 teb.lokey@enron.com SQL dump Sender ID Sender Mail Receiver ID Receiver ID 2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com SQL dump- Noise
  • 10. Receive Matrix Receive Matrix H[i,j] i- employee id of receiver j-employee id of sender Receiver ID Receiver Mail Send ID Send ID 6 taylor@enron.com 4 teb.lokey@enron.com 6 taylor@enron.com 4 teb.lokey@enron.com SQL dump Receiver ID Receiver Mail Send ID Send ID 2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com 2 kevin.hyatt@enron.com SQL dump- Noise
  • 11. -for more info…List location or contact for specification (or other related documents) Steps to find the CEO_ Step 1 From receive matrix for each row/receiver find the sender/senders who have send the minimum mails For employee 2/row 2 the minimum no is zero and is found at col :1,4,5 Receive matrix New matrix- C Replacing all the minimum values with 999 and other values with 0 for a row
  • 12. Steps to find the CEO_ Step 2 From the new matrix C, find for each employee how many times it was voted as parent Find the number of 999s present in a column, that would give one how may times the employee was voted as parent Find the maximum number of 999s for all the columns
  • 13. Steps to find the CEO_Step 3 Get the maximum number of times an employee could be voted as parent The maximum value comes around as 150 New send index matrix D- D[I,j] i- employee id J- no of times it got voted as parent
  • 14. Final Step for CEO Several employees have the maximum value voted as parent Noise Send index matrix Fact sheet Emp-ID : 129 Jeffrey skilling Emp-ID : 127 Kenneth Lay Eliminating noise
  • 15. Proposed Hierarchy from CEO find algorithm Jeffrey Skilling President & CEO Unknown Unknown
  • 16. Levels of Hierarchy First Children For each employee get the maximum number of messages sent For emp-1, the maximum number of messages sent was 3 so the possible first child would be 73 Send Matrix First Child
  • 17. Levels of Hierarchy Second Children For each employee get the second maximum number of messages sent For emp-1, the maximum number of messages sent was 3 and the next maximum value was 2 so the possible first children would be 17,53 Send Matrix Second Child
  • 18. Levels of Hierarchy Third Children For each employee get the third maximum number of messages sent For emp-2, the third maximum number of messages sent was 44 so the possible first children would be 19 Send Matrix Third Child
  • 19. Levels of Hierarchy Fourth Children For each employee get the fourth maximum number of messages sent For emp-2, the fourth maximum number of messages sent was 29 so the possible first children would be 4 Send Matrix Fourth Child
  • 20. Sample Level of Hierarchy Jeffrey Skilling - 129 CEO Kenneth Lay- 127 CEO Greg Walley-54 President John Arnold-44 Vice president Jeffrey Shank man- 36 President Andy Zipper- 78 Vice president John Lovarato- 53 CEO Louise Kitchen 107 President Barry Tycholitz- 38 Vice president
  • 21. Network Graph _Between CEOs Jeffrey Skilling John Lavratato Kenneth Lay David Delainey  The number of inter-communications between CEO is quite less.  The network traffic is quite less with respect to messages sent and received
  • 22. Network Graph _Between managers/vice-presidents/presidents Sample data  The number of inter-communications between mid level employees increases as we go down the CEO level
  • 23. Network Graph _Between employees Sample data  The number of inter-communications between employees is the highest amongst all the tiers.
  • 24. Network Graph _Intra communication_CEO-Managers Employee -Managers CEO- Managers  The number of intra-communications between CEO level and mid level employees is quite high Managers- Employees  The number of intra-communications between Manager level and lower level employees are highest
  • 25. Ratio of communication between different levels CEO Mnager Employee CEO Manager Employee Manager->CEOCEO->Manager Manager->Employee Employee->CEO
  • 26. Sub send Matrices Compute ratio of total number of messages sent for the different sent matrices. Ratio of communication between different levels Send Matrix Function CEO->manager Manager->CEO Manager - employee Employee-> Manager
  • 28. Detection of Anomalous Behavior in Employees Database Send matrix 2000 Send matrix 2001 eid Number of messages sent
  • 29. 2000 2001 Emails sent to Emails sent to lawyer+Trader lawyer + Trader Percentage change above threshold=25%? class learnt concept-clustering Detection of Anomalous Behavior in Employees
  • 30. Results-sample clusters Below Threshold samples Above Threshold samples CEO/President
  • 31. Email Stats over the year 2000 and 2001 for low/mid level employees
  • 32. -for more info…List location or contact for specification (or other related documents)  Higher level employees Email Stats over the year 2000 and 2001 for high level employees
  • 33. Temporal Analysis of emails sent for some high level employees
  • 34. Temporal Analysis of emails sent for some high level employees
  • 35.  Semantic analysis using the LIWC tool.  Probabilistic dependency . Future work