Drone Emprit
Konsep dan Teknologi
Ismail Fahmi, PhD.
Drone Emprit
Media Kernels Indonesia
Ismail.fahmi@gmail.com
IT CAMP – BIG DATA & DATA MINING
Onno Center, Situ Gintung - Jakarta
1 Oktober 2017
2
1992 – 1997 S1, Teknik Elektro, ITB
2003 – 2004 S2, Computational Linguistics, Universitas Groningen, Belanda
2004 – 2009 S3, Computational Linguistics, Universitas Groningen, Belanda
2000 – 2003 Inisiator IndonesiaDLN (Digital Library Network pertama di Indonesia)
Mengembangkan Ganesha Digital Library (GDL)
Mendirikan Knowledge Management Research Group (KMRG) ITB
Membangun Digital Library ITB
2009 – Sekarang Engineer di Weborama, Perusahaan berbasis big data (Paris/Amsterdam)
2012 – Sekarang Co-Founder Awesometrics, Media Monitoring & Analytics Company
2014 – Sekarang Founder PT. Media Kernels Indonesia, a Natural Language Processing Company
2015 – Sekarang Konsultan Perpustakaan Nasional, Inisiator Indonesia OneSearch
2017 – Sekarang Dosen Tetap Magister Teknik Informatika Universitas Islam Indonesia
Ismail Fahmi, PhD.
Ismail.fahmi@gmail.com
Agenda
SESI 1
• Konsep
• Tentang Drone Emprit
• Data, tambang emas baru
• Arsitektur & Fitur
• Teknologi
• Crawler
• Twitter
• Facebook
• Online News
• Indexing
• Sharding
• Replication
• Analytics
• Sentiment Analysis
• Opinion Analysis
• Term Extraction
• Clustering
• Social Network Analysis
• Visualisasi
SESI 2
• Studi Kasus
• Analisis Pilkada Jawa Barat
• Analisis Pro-Kontra PKI
• Membaca Agenda Setting Media
• Demo
• Membuat topik monitoring baru
• Membaca hasil analisis
• Edit sentimen
• Social Network Analysis
3
Tentang Drone Emprit
4
Media Kernels a.k.a Drone Emprit
• Sebuah sistem untuk memonitor dan
menganalisa media online dan sosial berbasis
teknologi big data.
• Dikembangkan sejak tahun 2009 di
Amsterdam, Belanda, oleh anak bangsa,
melalui Media Kernels Netherlands B.V.
• Mulai tahun 2012 digunakan di Indonesia.
• Berbasis teknologi Artificial Intelligent (Machine
Learning) dan Natural Language Processing
(NLP).
• Dikenal sebagai ‘Drone Emprit’ dalam berbagai
pemberitaan di TV dan Media Nasional.
5
Drone Emprit
6
2-8 Januari 2017
TEMPO
Topik: Peternakan hoax di
media sosial
Media Kernels:
• Diberitakan dengan
name ‘Drone Emprit’.
• Menyajikan peta
Social Network
Analysis (SNA)
tentang bagaimana
sebuah hoax berasal,
menyebar, siapa
influencers utama,
dan siapa groupnya.
• Beberapa isu yang
dianalisis: 10 Juta
Tenaga Kerja China,
dan Aleppo (ISIS).
LAPORAN UTAMA TEMPO, 2-8 Januari 2017
confidential
7
12 Januari 2017
KANTOR STAF
PRESIDEN
Kasus: Isu hoax
menyerang pemerintah
tentang 10 Juta Tenaga
Kerja China Illegal.
Media Kernels:
• Menyajikan dua studi
kasus: 10 Juta tenaga
kerja china illegal,
dan sentimen negatif
terhadap gerakan anti
hoax.
• Menunjukkan timeline
resonansi isu, dan
peta percakapan
dengan fitur SNA.
• Menunjukkan kurang
efektifnya komunikasi
pemerintah, dan apa
yang bisa dilakukan
untuk perbaikan.
FGD KEHUMASAN SELURUH KEMENTERIAN DAN
LEMBAGA DI KANTOR STAF PRESIDEN (KSP)
confidential
8
22 Maret 2017
MATA NAJWA
Kasus: Virus Dusta (alias
Hoax)
Nara Sumber:
• Stanley (Dewan Pers)
• Johan Budi (Stafsus
Presiden)
• Boy Rafli (Humas Polri)
• Ismail Fahmi (MK)
• Septiaji & Khairul
Anshar (Masy. Anti
Hoax)
Media Kernels:
• Menyajikan analisis ttg
10 Juta Tenaga Kerja
China Illegal.
• Hoax Panglima TNI vs
PKI.
MATA NAJWA LIVE ‘VIRUS DUSTA’
Data is New Gold
9
10
6 Mei 2017
Data Collection: Gold = Expensive
11
Free Data
12
Twitter Analysis: World Eco. Forum 2016
13
https://medium.com/@swainjo/wef16-davos-twitter-sna-analysis-4c38cf4bc46d
14
Arsitektur
15
MK Big Data Architecture
confidential
16
News Crawler
Twitter Crawler
Twitter Streaming
FB Page Crawler
Data Pipeline
Data
SOLR Indexer 1 SOLR Indexer 2 SOLR Indexer 3 SOLR Indexer 4
Hadoop Framework
Physical Hardware
Insight
DataIngest
Management&Queue
RealtimeJob
Processing
Google Custom
Search
Database Framework
ScheduledJob
Processing
Map Reduce
Sentiment
Analysis
Other
Processings
Data&Workflow
Management
Access
Visualization
Other sources
Analytics UI
17
Social Media
Twitter
Facebook
Search+JSON
Detik (ID)
Reuters (EN)
Etc..
RSS+HTML
Gatra (ID)
Bloomberg (EN)
Etc..
HTML
Kaskus
Detik Forum
Etc..
HTML
Online News
Forums
Twitter StreamJSON
Kompas
TEXT
Warta Ekonomi
Etc..
Print
PUSHJSON
Subscriber
Projects
Storage
Search + Account
Crawler
RSS + HTML
Crawler
HTML Crawler
HTML Crawler
SOLR Nodes
Shard 1
SOLR Nodes
Shard N
Index Servers
Redis Queue
Cache Manager
Mentions
Storage
Keywords +
Accounts Filters
deletes
Sentiment
Analysis
Sentiment
Models
Backtrack
Filters
Sentiment
Analysis
Analyses
Control Room
Screens
Smart phones,
tablets
Desktops
Client(s)
Converter
System Architecture
Fitur-fitur Media Kernels
confidential
18
Trends
DASHBOARD
Comparison
Topic Map
NEWS PORTAL
Latest News
Media
ANALYTICS
News Sites
Page Ranks
Sentiment Analysis
PF-Chart
Engagement
Exposure
Retweets
TOPICS
Replies
Most Shared URLs
Most Shared Videos
Topic Map
Word Cloud
Impact
INFLUENCERS
Engagement
Reach
Most Engaged
Followers
Influencer Network
SNA
Topic Network
PR-Values
Reach
Hashtags Posts
Bubble Map
Twitter User Map
DEMOGRAPHY
User Locations
Edit Sentiments
MENTIONS
Training & Learning
Backtracking
Compare SNA
COMPARE
Compare Projects
Popularity vs
Favorability
Background Jobs
Upload Report
REPORTING
Download Report
User Management
ADMIN
Project Management
Client Management
Source Management
Label and Training
OPINION ANALYSIS
Opinion Chart
Insight Explorer
News Crawler
19
Online News
20
Dan Ratusan Media
Non-mainstream
Crawling Online News
21
Crawler Indeks Server
Web Crawler Tools
22
http://bigdata-madesimple.com/top-50-open-source-
web-crawlers-for-data-mining/
Web Crawler Tools (2)
23
http://bigdata-madesimple.com/top-50-open-source-
web-crawlers-for-data-mining/
Contoh: Scrapy.org
24
Web Crawler Drone Emprit
25
Bikin sendiri, powered by:
Anatomi: Metadata dan Fullteks
26
Ambil:
Tanggal, judul, isi berita, penulis, url gambar
Buang:
Iklan, daftar headline, komentar.
Twitter API
27
API: search/tweets
28
Contoh: Free Twitter Search
29
History: 7 days
Start search
100% results
API: Realtime (Sample)
30
Random SampleAll Statuses
Kurang dari 10%
API: Realtime (Filter)
31
API: Realtime (Filter)
32
Filtered StatusesAll Statuses
~ 100%
POST statuses/filter
Filter max 400 keywords
Filter:
Max 400 keywords
API: > 400 keywords?
33
All Statuses
Max 400
keywords
Server	
IP	Addr 1
Server	
IP	Addr 2
Server	
IP	Addr n
Max 400
keywords
Max 400
keywords
Twitter API Tools
34
Net::Twitter
Twitter API: Drone Emprit
35
Net::Twitter
AnyEvent::Twitter::Stream
Facebook API
36
FB API (v1): Public Search
37
April 2014 à distop Facebook
FB API (v2): Searching
38
FB API (v2): Object
39
https://graph.facebook.com/$object_id/$type?
fields=id,
parent_id,
from,
to,
type,
status_type,
story,
message,
link,
likes.summary(true),
shares,
comments.order(reverse_chronological).summary(true),
created_time,
updated_time
&order=reverse_chronological
&access_token=$access_token&limit=$limit&until=$last_timestamp
$object_id = FB Page ID, etc
$type = [feed, comment, ...]
FB API Tools
40
Facebook::Graph
fb 0.4.0
FB API: Drone Emprit
41
WWW::Curl
Bikin sendiri, powered by:
Question: Perl or Python?
42
Of course!
Why Perl?
43
Perl yang menolong
manusia setelah jatuh di
bumi, dan tentu lebih
‘nyunah’
Python yang bikin
Adam-Hawa tergoda,
lalu turun dari surga
Search Engine/Indexing
44
Full Text Indexing
45
Data Sources Search Engine
Full Text Search Engines
46
Search Engine: Drone Emprit
47
Simple - Powerful - Robust - Scalable
Solr Server Configuration
48
Sharding
49
Replication
50
Analytics
51
Analytics: Server Configuration
52
Slave Analysis Results
Analysis
Processes
Analytics Engine
53
Search by
Keywords
News, Twits, Statuses, etc
Sentiment Analysis
Opinion Analysis
Term Extraction
Segmentation
Quote Extraction
Named Entity Recognition
Search
Results
Paragraph Segmentation
54
NEWS ARTICLES MENTIONS
Sentiment Analysis
55
Sentiment Analysis
56
Positif
Negatif
Netral
?
MENTIONS
Sentiment Analysis
57
Positif
?
MENTIONS
Untuk Setya Novanto
Sentiment Analysis
58
Negatif?
MENTIONS
Untuk KPK
Sentiment Analysis
59
Netral
?
MENTIONS
Untuk Hakim Cepi Iskandar
Sentiment Analysis Techniques
60
http://www.sciencedirect.com/science/article/pii/S2090447914000550
Evaluasi
61
http://www.sciencedirect.com/science/article/pii/S2090447914000550
”one model for all” tidak bisa
memberi label yang tepat untuk
setiap subyek.
Lexicon base tergantung dari
keberadaan kata dalam kamus sentimen,
tidak bisa memberi label yang tepat
untuk subyek yang berbeda.
Sentiment Analysis Tools
62
https://breakthroughanalysis.com/2012/01/08/what-are-
the-most-powerful-open-source-sentiment-analysis-tools/
Text Mining
Module
Sentiment Analysis: Drone Emprit
63
Adaptive Multiple Models
Training Data
64DOI: 10.1109/ICMLA.2015.22
81.000
Opinion Analysis
65
Kapolri: Opinion Analysis
66
Bersama DivHumas Polri di Kompas Petang
67
Fitur Opinion Analysis MK
68
Analisis Terhadap Statistik
69
Membaca Voice, bukan Noise
70
Analisis Terpengaruh Noise
71
Sayang, analisis berbasis
‘noise’ ini yang menjadi viral.
Opinion Analysis Techniques
72
Drone Emprit
Regular Expression
Opinion Analysis
Quote Extraction
73
Quote Extraction
74
QUOTE QUOTE HOLDER
Quote Extraction: Drone Emprit
75
Pattern Matching dengan
Regular Expression
Named Entity Recognition
76
Named Entity Recognition
77
LOCATION PERSON ORGANIZATION
NER Tools
78
NER: Drone Emprit
79
Contoh NER
80
Clustering
81
Clustering
82
Clustering Types
83
Clustering Tools
84
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
Topic Map: Document Clustering
85
Social Network Analysis
86
SNA: Social Network Analysis
• SNA adalah pemetaan terhadap
relasi antar orang, organisasi,
topik, lokasi, dan entitas
informasi lainnya.
• Node atau titik di dalam
jaringan menggambarkan
orang, organisasi, lokasi, atau
entitas informasi.
• Garis sambungan antar titik
menggambarkan relasi antar
titik.
87
Betweenness Centrality
88
Betweenness Centrality:
a measure of centrality.
Highest betweenness centrality
(8 connections)
Lowest betweenness centrality
(4 connections)
Anatomi Sebuah Twit
89
Anatomi Sebuah Twit
90
Relasi Retweet
91
Link Functions: Retweet / Mention
92
Retweet Network
94
Mention Network
Information Arbitrage
95
96
Information arbitrage: translate
information across groups
Visualization
97
User Dashboard
98
Analysis Results
Slave
Visualization Tools
99
D3js.org
100
Drone Emprit is Hiring
101
System Administrator &
Programmer
Terimakasih
102
Ismail Fahmi, PhD
Drone Emprit
PT Media Kernels Indonesia
Email: ismail.fahmi@gmail.com
Hp: 0812 8908 3894

Drone Emprit: Konsep dan Teknologi