Tokyotextmining#1 kaneyama genta

  @PENGUINANA_
(genta
kaneyama)

http://pcod.no-‐ip.org/

 
visualization

  http://b.hatena.ne.jp/pcod/nlp/

  twitter (yats)

  2009 Web

 

 

  1

 

 
 
 



  900 tweet/

  60

  3
AND
>400

  …



  Learning
to
Classify
Short
and
Sparse
Text
&
Web
with

Hidden
Topics
from
Large-‐scale
Data
Collections

(WWW2008)

  Wikipedia,
MEDLINE LDA

  Wikipedia ”universal
corpus”



…

http://www.baidu.jp/unlp/#omake

API
http://pcod.no-‐ip.org/yats/genre

http://pcod.no-‐ip.org/yats/

 
 

 
 

 

 

 

  API

 



Learning
to
Classify
Short
and
Sparse
Text
&
Web
with
Hidden
Topics
from

Large-‐scale
Data
Collections
(WWW2008)

?
  Unlike
normal
documents,
these
text
&
Web
segments
are

usually
noisier,
less
topic-‐focused,
and
much
shorter,
that

is,
they
consist
of
from
a
dozen
words
to
a
few
sentences.

Because
of
the
short
length,
they
do
not
provide
enough

word
co-‐
occurrence
or
shared
context
for
a
good

similarity
measure.



  (Wikipedia )

  LDA

  (Wikipedia )

  ME(maximum
entropy)

 

  LDA



LDA(model)
-‐>
MaxEnt(classiﬁer)
  LDA sparse
text

D.
Blei,
A.
Ng,
and
M.
Jordan.
Latent
Dirichlet

Allocation.
JMLR,
3:993–1022,
2003.

  SVM ME

  SVM

  SVM ( )

  Wikipedia

 

 

  Hidden
topic

 



 

  Universal
corpus

 

 
MySQL
+
Python
+
TokyoCabinet

 

  MeCab(+ )
+
Python
+
MPICH2

  API

  Python
+
Django(or
tornado)
+
apache(or
nginx)
+
redis

 

  , ,URL

  if
(' '
in
n.feature
and
not
' , '
in
n.feature
and
not

' '
in
n.feature)

  (Python+TokyoCabinet)

  cat
dump.txt
|
python
loadtoTC.py

  >df.addint(key,1)

  ( )

  100

 



  plda …

  http://code.google.com/p/plda/

 

  N=50,alpha=0.5,beta=0.1

  Alpha:*empirically*
MIN(1,50/num_topics)

  beta:

*empirically*
0.1

Master>
mpd
-‐-‐daemon
-‐-‐listenport=55555

Slave>
mpd
-‐-‐daemon
-‐h
master
-‐p
55555

mpdtrace;
mpdringtest

mpiexec
-‐n
8
./mpi_lda
-‐-‐num_topics
50
-‐-‐total_iterations
150
-‐-‐alpha
1
-‐-‐beta
0.1
-‐-‐
training_data_ﬁle
~/201005.txt
-‐-‐model_ﬁle
/tmp/lda_model201005.txt

9249.0
8833.0
7755.0
6846.0
5895.0
4765.0

4075.0
3669.0
3399.0
3340.0
3297.0
3223.0

3152.0
3053.0
3044.0
2926.0
2812.0
2670.0

2613.0
2607.0
2443.0
2428.0
2330.0

2244.0
2234.0
2219.0
2150.0
2058.0
2041.0

1956.0
1887.0
1855.0
1772.0
1765.0
1759.0
1741.0

1739.0
1685.0
1599.0
1582.0
1538.0
> 1535.0

1535.0
1524.0
1513.0
1500.0
1500.0
1471.0

1468.0
1464.0
1390.0
1358.0
1307.0
1280.0

> 1275.0
1268.0
1242.0
1204.0
1184.0

1151.0
1145.0
1132.0
1128.0
1107.0
1107.0

1106.0
1096.0
1092.0
1060.0
1057.0
1046.0

1041.0
1029.0
1025.0
1025.0
1015.0
1013.0
1000.0

939.0
922.0
906.0
906.0
904.0
900.0

899.0
898.0
892.0
891.0
879.0
877.0
867.0

865.0
863.0
858.0
852.0
852.0
844.0
840.0

831.0
821.0

label,4,

20262.0
5296.0
4542.0
4439.0
4352.0
4295.0

4266.0
3585.0
3554.0
3217.0
3190.0
3165.0

2812.0
2630.0
2541.0
2486.0
2456.0
2371.0

2324.0
2135.0
2084.0
2044.0
2030.0
1962.0

1910.0
1866.0
1769.0
1734.0
1679.0
1662.0
1575.0

1557.0
1486.0
1465.0
1441.0
1403.0
1302.0
1280.0

1278.0
1271.0
1268.0
1267.0
1245.0

1214.0
1208.0
1159.0
1154.0
1129.0
1122.0

1104.0
1096.0
1095.0
1075.0
1050.0
1035.0
1032.0

1029.0
1024.0
> 1021.0
1010.0
963.0
946.0
940.0

860.0
858.0
855.0
843.0
808.0
794.0
789.0

781.0
779.0
777.0
769.0
> 768.0
763.0
761.0

752.0
750.0
750.0
748.0
746.0
742.0
737.0

724.0
724.0
720.0
717.0
716.0
706.0
702.0

686.0
686.0
680.0
678.0
678.0
660.0
650.0

642.0
637.0

label,1,

56227.0
22429.0
12037.0
10790.0
8672.0
7979.0

4682.0
4612.0
4376.0
4356.0
4010.0
2891.0

2700.0
2637.0
2588.0
2579.0
2482.0
2477.0
2426.0

2384.0
2353.0
2331.0
2326.0
2194.0
2191.0
2149.0

2128.0
2078.0
2045.0
2030.0
2025.0
1876.0
1873.0

1841.0
1833.0
1776.0
1715.0
1713.0
1666.0

1593.0
1577.0
1532.0
1513.0
1407.0
>
1396.0
1374.0
1370.0
1370.0
1358.0
1316.0
1226.0

1218.0
1151.0
1105.0
1086.0
1034.0
979.0

967.0
947.0
945.0
930.0
927.0
886.0
882.0

872.0
871.0
850.0
840.0
831.0
825.0

808.0
799.0
793.0
792.0
783.0
762.0
759.0

758.0
753.0
751.0
741.0
733.0
732.0
714.0

713.0
708.0
705.0
700.0
686.0
660.0
659.0

641.0
633.0
625.0
622.0
617.0
615.0
610.0

610.0
608.0

label,2,???

8448.0
7233.0
6245.0
6119.0
5919.0
4790.0

4294.0
4212.0
3958.0
3814.0
3676.0
3196.0

3032.0
> 2861.0
2851.0
2812.0
2605.0

2588.0
2413.0
2383.0
2302.0
1833.0
1817.0

1786.0
1662.0
1637.0
1623.0
1617.0
1613.0
1546.0

1539.0
1500.0
1488.0
1469.0
1463.0
1463.0

1459.0
1450.0
1448.0
1395.0
1350.0
1328.0

1311.0
1298.0
1290.0
1285.0
1269.0
1247.0
1232.0

1216.0
1215.0
1207.0
1204.0
1191.0
1162.0

1155.0
1152.0
1146.0
1115.0
1103.0
1095.0

1072.0
1048.0
1037.0
1018.0
1018.0
1017.0

1017.0
1003.0

993.0
984.0
970.0
966.0
961.0
956.0

949.0
940.0
936.0
928.0
920.0
919.0
916.0

896.0
890.0
889.0
889.0
888.0
887.0

880.0
869.0
868.0
856.0
850.0
847.0
842.0

833.0
830.0
829.0
829.0

806.0

label,6,

31845.0
9928.0
8205.0
7709.0
6715.0
4650.0
4461.0

4455.0
3718.0
3688.0
3592.0
3495.0
3483.0
2965.0

2651.0
2638.0
2336.0
2261.0
2151.0
2140.0

2112.0
2071.0
2065.0
2024.0
1995.0
1986.0
1973.0

1967.0
1942.0
1926.0
1838.0
1800.0
1760.0
1760.0

1727.0
1584.0
1500.0
1471.0
1445.0
1344.0
1302.0

1280.0
1221.0
1181.0
1174.0
1170.0
1162.0
1154.0

1146.0
1129.0
1116.0
1101.0
1072.0
1069.0

1057.0
1043.0
1024.0
1022.0
981.0
979.0
972.0

960.0
> 953.0
945.0
935.0
934.0
924.0
910.0

898.0
869.0
856.0
852.0
845.0
826.0
821.0

803.0
801.0
800.0
798.0
794.0
773.0
771.0
755.0

747.0
746.0
739.0
734.0
733.0
720.0
714.0

709.0
707.0
706.0
699.0
694.0
692.0
688.0

684.0
682.0
681.0

label,9,

 

 

 

 

 



  Latent
topic

 

  Latent
topic

 

 

 

 

 



API
Get

1.  
2. (redis) 
3.

[Topic1:prob,Topic2:prob,]

JSON

  PLDA

Collaborative
Filtering
for
Orkut
Communities:
Discovery
of
User
Latent
Behavior.
Wen-‐
Yen
Chen
et
al.,
WWW
2009.

http://www.cs.ucsb.edu/~wychen/publications/fp365-‐chen.pdf

  …

The
role
of
semantic
history
on
online
generative
topic
modeling.
L
AlSumait,
D

Barbará,
C
Domeniconi
-‐
ise.gmu.edu

http://www.ise.gmu.edu/~carlotta/publications/Siam_SemOLDA.pdf

  LDA …

R LDA author facebook/data
LDA

  Not-‐So-‐Latent
Dirichlet
Allocation:
Collapsed
Gibbs
Sampling
Using
Human
Judgments

  ePluribus:
Ethnicity
on
Social
Networks.

http://www.facebook.com/data#!/data?v=app_4949752878

http://www.baidu.jp/unlp/#omake

Q&A

Tokyotextmining#1 kaneyama genta

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Tokyotextmining#1 kaneyama genta

Similar to Tokyotextmining#1 kaneyama genta (20)

Recently uploaded

Recently uploaded (20)

Tokyotextmining#1 kaneyama genta