Analisis Klasifikasi Data E-mail Spam dengan Pendekatan Machine Learning

•

0 likes•43 views

Dokumen ini merangkum analisis klasifikasi data email spam menggunakan pendekatan machine learning. Penelitian ini menggunakan dataset spambase UCI yang berisi 4601 baris data dan 58 kolom. Metode klasifikasi yang digunakan adalah Gradient Boosting Classifier yang mampu mengklasifikasikan data dengan akurasi 93,5%. Hasilnya mengidentifikasi 2875 email non-spam dan 1726 spam.

Science

1
Analisis Klasifikasi Data E-mail Spam
dengan Pendekatan Machine Learning
Oleh :
Anadia Rahmat Syihab Hidayatullah
06211540000001

OUTLINE
2
3
2
1Pendahuluan
Metodologi Penelitian
Analisis dan Pembahasan

4
Latar Belakang
(Yuda, 2016)
Spam adalah
penggunaan perangkat
elektronik untuk
mengirimkan pesan
secara bertubi-tubi
tanpa dikehendaki oleh
penerimanya.
Konsep "spam" berdasarkan data spambase UCI
adalah iklan untuk produk / situs web, membuat
skema uang cepat, surat berantai, pornografi dan
lain-lain.
(UCI Repository)
[Dampak]  Ruang penyimpanan kotak masuk
tercampur dengan informasi
bersifat spam
 Memberikan efek berat pada
penyimpanan pesan email
[Solusi]

6
Sumber
Data
 Data Sekunder
 Berjudul “Spambase Dataset ”
 Berisi
 4601 baris data
 58 kolom data

STRUKTUR
DATA
7
Struktur data dalam penelitian ini adalah berupa kata yang
variabel berskala rasio yang disusun melalui praproses
terlebih dahulu sehingga sudah berbentuk persentase,
berikut adalah strukur data penelitian ini
No.
Word_freq
_make
Word_freq_
address
Word_freq
_all
... crl_average Class
1 0 0,64 0,64 ... 61 1
2 0,21 0,28 0,5 ... 101 1
3 0,06 0 0,71 ... 485 1
. . . . . . .
. . . . . . .
. . . . . . .
4601 0 0 0,65 .. 5 0
Y adalah kelas / Tipe email
Spam berlabel 1
NonSpam berlabel 0

Variabel
Penelitian
8
Variabel :
 Variabel independen ( X ) terdapat 57 variabel
 Variabel dependen ( Y ) terdapat 1 variabel
No Atribut Tipe data Variabel
1 %Word_freq_make Rasio X
2 %Word_freq_address Rasio X
3 %Word_freq_all Rasio X
4 %Word_freq_3d Rasio X
5 %Word_freq_our Rasio X
6 %Word_freq_remove Rasio X
7 %Word_freq_over Rasio X
... ... ...
... ... ...
57 crl_average Rasio X
58 Y(Spam = 1 dan Non Spam = 0) Nominal Y

9
Import Libraries
Preprocessing
Data
Data Exploration
Import Dataset
Feature Selection
Analisis Klasifikasi
Kesimpulan dan Saran

11
Missing
Value
No Variabel
Missing
Value
1 %Word_freq_make 0
2 %Word_freq_address 0
3 %Word_freq_all 0
4 %Word_freq_3d 0
5 %Word_freq_our 0
6 %Word_freq_remove 0
7 %Word_freq_over 0
. . .
. . .
57 crl_average 0
58 Y 0
Tidak terdapat kasus missing value
Preprocessing

12
Boxplot Terdapat kasus outlier
Preprocessing

13
IQR
False = Bukan Outlier
Preprocessing
True = Outlier

ExtraTrees
Classifier
14
Tahapan mengeliminasi ukuran dari variabel yang
semula 57 menjadi 20 variabel yang berpengaruh
signifikan menggunakan ExtraTreesClassifier

16
Analisis Klasifikasi
No Metode Klasifikasi Akurasi
1 CART 0,867
2 k-Nearest Neighbour 0,793
3 Naive Bayes 0,816
4 Support Vector Machine 0,871
5 Random Forest 0,902
6 Bagging 0,890
7 Adaptive Boosting 0,901
8 Gradient Boosting 0,906
9 Logistic Regression 0,881
10 Neural Network 0,880
Non Spam Total
Non 2684 191 2875
Spam 104 1622 1726
Gradient
Boosting
Akurasi 93,5%
Presisi 93,6%
Recall 92,8%

18
Saran Kesimpulan
 Dari 57 variabel / feature, hanya 20 yang
signifikan terkadap pembentukan model
klasifikasi.
 Metode klasifikasi terbaik dari 10 metode yang
dicobakan adalah Gradient Boosting Classifier
dengan nilai akurasi, presisi dan recall berturut-
turut 93,5%, 93,6% dan 92,8%.
 Hasil klasifikasi menghasilkan prediksi yang
hampir mendekati yakni 2875 non spam, 1726
spam sedangkan data Y sebenarnya yakni 2788
non spam, 1813 spam
 Saran terhadap final project ini adalah
melakukan lebih banyak metode baik
dalam preprocessing data maupun
analisis atau klasifikasi khususnya.

19
Analisis Klasifikasi Data E-mail Spam
dengan Pendekatan Machine Learning
Oleh :
Anadia Rahmat Syihab Hidayatullah
06211540000001

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024

Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)

contently

How to Prepare For a Successful Job Search for 2024

Albert Qian

A report by thenetworkone and Kurio. The contributing experts and agencies are (in an alphabetical order): Sylwia Rytel, Social Media Supervisor, 180heartbeats + JUNG v MATT (PL), Sharlene Jenner, Vice President - Director of Engagement Strategy, Abelson Taylor (USA), Alex Casanovas, Digital Director, Atrevia (ES), Dora Beilin, Senior Social Strategist, Barrett Hoffher (USA), Min Seo, Campaign Director, Brand New Agency (KR), Deshé M. Gully, Associate Strategist, Day One Agency (USA), Francesca Trevisan, Strategist, Different (IT), Trevor Crossman, CX and Digital Transformation Director; Olivia Hussey, Strategic Planner; Simi Srinarula, Social Media Manager, The Hallway (AUS), James Hebbert, Managing Director, Hylink (CN / UK), Mundy Álvarez, Planning Director; Pedro Rojas, Social Media Manager; Pancho González, CCO, Inbrax (CH), Oana Oprea, Head of Digital Planning, Jam Session Agency (RO), Amy Bottrill, Social Account Director, Launch (UK), Gaby Arriaga, Founder, Leonardo1452 (MX), Shantesh S Row, Creative Director, Liwa (UAE), Rajesh Mehta, Chief Strategy Officer; Dhruv Gaur, Digital Planning Lead; Leonie Mergulhao, Account Supervisor - Social Media & PR, Medulla (IN), Aurelija Plioplytė, Head of Digital & Social, Not Perfect (LI), Daiana Khaidargaliyeva, Account Manager, Osaka Labs (UK / USA), Stefanie Söhnchen, Vice President Digital, PIABO Communications (DE), Elisabeth Winiartati, Managing Consultant, Head of Global Integrated Communications; Lydia Aprina, Account Manager, Integrated Marketing and Communications; Nita Prabowo, Account Manager, Integrated Marketing and Communications; Okhi, Web Developer, PNTR Group (ID), Kei Obusan, Insights Director; Daffi Ranandi, Insights Manager, Radarr (SG), Gautam Reghunath, Co-founder & CEO, Talented (IN), Donagh Humphreys, Head of Social and Digital Innovation, THINKHOUSE (IRE), Sarah Yim, Strategy Director, Zulu Alpha Kilo (CA).

Social Media Marketing Trends 2024 // The Global Indie Insights

Kurio // The Social Media Age(ncy)

The search marketing landscape is evolving rapidly with new technologies, and professionals, like you, rely on innovative paid search strategies to meet changing demands. It’s important that you’re ready to implement new strategies in 2024. Check this out and learn the top trends in paid search advertising that are expected to gain traction, so you can drive higher ROI more efficiently in 2024. You’ll learn: - The latest trends in AI and automation, and what this means for an evolving paid search ecosystem. - New developments in privacy and data regulation. - Emerging ad formats that are expected to make an impact next year. Watch Sreekant Lanka from iQuanti and Irina Klein from OneMain Financial as they dive into the future of paid search and explore the trends, strategies, and technologies that will shape the search marketing landscape. If you’re looking to assess your paid search strategy and design an industry-aligned plan for 2024, then this webinar is for you.

Trends In Paid Search: Navigating The Digital Landscape In 2024

Search Engine Journal

From their humble beginnings in 1984, TED has grown into the world’s most powerful amplifier for speakers and thought-leaders to share their ideas. They have over 2,400 filmed talks (not including the 30,000+ TEDx videos) freely available online, and have hosted over 17,500 events around the world. With over one billion views in a year, it’s no wonder that so many speakers are looking to TED for ideas on how to share their message more effectively. The article “5 Public-Speaking Tips TED Gives Its Speakers”, by Carmine Gallo for Forbes, gives speakers five practical ways to connect with their audience, and effectively share their ideas on stage. Whether you are gearing up to get on a TED stage yourself, or just want to master the skills that so many of their speakers possess, these tips and quotes from Chris Anderson, the TED Talks Curator, will encourage you to make the most impactful impression on your audience. See the full article and more summaries like this on SpeakerHub here: https://speakerhub.com/blog/5-presentation-tips-ted-gives-its-speakers See the original article on Forbes here: http://www.forbes.com/forbes/welcome/?toURL=http://www.forbes.com/sites/carminegallo/2016/05/06/5-public-speaking-tips-ted-gives-its-speakers/&refURL=&referrer=#5c07a8221d9b

5 Public speaking tips from TED - Visualized summary

SpeakerHub

Everyone is in agreement that ChatGPT (and other generative AI tools) will shape the future of work. Yet there is little consensus on exactly how, when, and to what extent this technology will change our world. Businesses that extract maximum value from ChatGPT will use it as a collaborative tool for everything from brainstorming to technical maintenance. For individuals, now is the time to pinpoint the skills the future professional will need to thrive in the AI age. Check out this presentation to understand what ChatGPT is, how it will shape the future of work, and how you can prepare to take advantage.

ChatGPT and the Future of Work - Clark Boyd

Clark Boyd

Getting into the tech field. what next

Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Lily Ray

How to have difficult conversations

Rajiv Jayarajah, MAppComm, ACC

Introduction to Data Science

Christy Abraham Joy

Time Management & Productivity - Best Practices

Vit Horky

The six step guide to practical project management If you think managing projects is too difficult, think again. We’ve stripped back project management processes to the basics – to make it quicker and easier, without sacrificing the vital ingredients for success. “If you’re looking for some real-world guidance, then The Six Step Guide to Practical Project Management will help.” Dr Andrew Makar, Tactical Project Management

The six step guide to practical project management

MindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

RachelPearson36

During this webinar, Anand Bagmar demonstrates how AI tools such as ChatGPT can be applied to various stages of the software development life cycle (SDLC) using an eCommerce application case study. Find the on-demand recording and more info at https://applitools.info/b59 Key takeaways: • Learn how to use ChatGPT to add AI power to your testing and test automation • Understand the limitations of the technology and where human expertise is crucial • Gain insight into different AI-based tools • Adopt AI-based tools to stay relevant and optimize work for developers and testers * ChatGPT and OpenAI belong to OpenAI, L.L.C.

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Applitools

12 Ways to Increase Your Influence at Work

GetSmarter

ChatGPT webinar slides

Alireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Project for Public Spaces & National Center for Biking and Walking

Has your project been caught in a storm of deadlines, clashing requirements, and the need to change course halfway through? If yes, then check out how the administration team navigated through all of this, relocating 160 people from 3 countries and opening 2 offices during the most turbulent time in the last 20 years. Belka Games’ Chief Administrative Officer, Katerina Rudko, will share universal approaches and life hacks that can help your project survive unstable periods when there seem to be too many tasks and a lack of time and people.

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

DevGAMM Conference

Barbie - Brand Strategy Presentation

Erica Santiago

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Analisis Klasifikasi Data E-mail Spam dengan Pendekatan Machine Learning

1. 1 Analisis Klasifikasi Data E-mail Spam dengan Pendekatan Machine Learning Oleh : Anadia Rahmat Syihab Hidayatullah 06211540000001

2. OUTLINE 2 3 2 1Pendahuluan Metodologi Penelitian Analisis dan Pembahasan

3. 3 BAB I Pendahuluan

4. 4 Latar Belakang (Yuda, 2016) Spam adalah penggunaan perangkat elektronik untuk mengirimkan pesan secara bertubi-tubi tanpa dikehendaki oleh penerimanya. Konsep "spam" berdasarkan data spambase UCI adalah iklan untuk produk / situs web, membuat skema uang cepat, surat berantai, pornografi dan lain-lain. (UCI Repository) [Dampak]  Ruang penyimpanan kotak masuk tercampur dengan informasi bersifat spam  Memberikan efek berat pada penyimpanan pesan email [Solusi]

5. 5 BAB II Metodologi Penelitian

6. 6 Sumber Data  Data Sekunder  Berjudul “Spambase Dataset ”  Berisi  4601 baris data  58 kolom data

7. STRUKTUR DATA 7 Struktur data dalam penelitian ini adalah berupa kata yang variabel berskala rasio yang disusun melalui praproses terlebih dahulu sehingga sudah berbentuk persentase, berikut adalah strukur data penelitian ini No. Word_freq _make Word_freq_ address Word_freq _all ... crl_average Class 1 0 0,64 0,64 ... 61 1 2 0,21 0,28 0,5 ... 101 1 3 0,06 0 0,71 ... 485 1 . . . . . . . . . . . . . . . . . . . . . 4601 0 0 0,65 .. 5 0 Y adalah kelas / Tipe email Spam berlabel 1 NonSpam berlabel 0

8. Variabel Penelitian 8 Variabel :  Variabel independen ( X ) terdapat 57 variabel  Variabel dependen ( Y ) terdapat 1 variabel No Atribut Tipe data Variabel 1 %Word_freq_make Rasio X 2 %Word_freq_address Rasio X 3 %Word_freq_all Rasio X 4 %Word_freq_3d Rasio X 5 %Word_freq_our Rasio X 6 %Word_freq_remove Rasio X 7 %Word_freq_over Rasio X ... ... ... ... ... ... 57 crl_average Rasio X 58 Y(Spam = 1 dan Non Spam = 0) Nominal Y

9. 9 Import Libraries Preprocessing Data Data Exploration Import Dataset Feature Selection Analisis Klasifikasi Kesimpulan dan Saran

10. 10 BAB III Analisis dan Pembahasan

11. 11 Missing Value No Variabel Missing Value 1 %Word_freq_make 0 2 %Word_freq_address 0 3 %Word_freq_all 0 4 %Word_freq_3d 0 5 %Word_freq_our 0 6 %Word_freq_remove 0 7 %Word_freq_over 0 . . . . . . 57 crl_average 0 58 Y 0 Tidak terdapat kasus missing value Preprocessing

12. 12 Boxplot Terdapat kasus outlier Preprocessing

13. 13 IQR False = Bukan Outlier Preprocessing True = Outlier

14. ExtraTrees Classifier 14 Tahapan mengeliminasi ukuran dari variabel yang semula 57 menjadi 20 variabel yang berpengaruh signifikan menggunakan ExtraTreesClassifier

15. Feature Importances 15

16. 16 Analisis Klasifikasi No Metode Klasifikasi Akurasi 1 CART 0,867 2 k-Nearest Neighbour 0,793 3 Naive Bayes 0,816 4 Support Vector Machine 0,871 5 Random Forest 0,902 6 Bagging 0,890 7 Adaptive Boosting 0,901 8 Gradient Boosting 0,906 9 Logistic Regression 0,881 10 Neural Network 0,880 Non Spam Total Non 2684 191 2875 Spam 104 1622 1726 Gradient Boosting Akurasi 93,5% Presisi 93,6% Recall 92,8%

17. 17 BAB IV Kesimpulan Dan Saran

18. 18 Saran Kesimpulan  Dari 57 variabel / feature, hanya 20 yang signifikan terkadap pembentukan model klasifikasi.  Metode klasifikasi terbaik dari 10 metode yang dicobakan adalah Gradient Boosting Classifier dengan nilai akurasi, presisi dan recall berturut- turut 93,5%, 93,6% dan 92,8%.  Hasil klasifikasi menghasilkan prediksi yang hampir mendekati yakni 2875 non spam, 1726 spam sedangkan data Y sebenarnya yakni 2788 non spam, 1813 spam  Saran terhadap final project ini adalah melakukan lebih banyak metode baik dalam preprocessing data maupun analisis atau klasifikasi khususnya.

19. 19 Analisis Klasifikasi Data E-mail Spam dengan Pendekatan Machine Learning Oleh : Anadia Rahmat Syihab Hidayatullah 06211540000001

Analisis Klasifikasi Data E-mail Spam dengan Pendekatan Machine Learning

Recommended

Recommended

More Related Content

Featured

Featured (20)

Analisis Klasifikasi Data E-mail Spam dengan Pendekatan Machine Learning