SlideShare a Scribd company logo
TOPIC MODELING USING
BIG DATA ANALYTICS
-BY
SARAH MASUD(12-CSS-57)
FARHEEN NILOFER(12-CSS-23)
INTRODUCTION
WORK FLOW:
Installation of Hadoop on multiple nodes- For distributed processing.
Pre processing of data set - Cleaning and conversion of data into desired
format.
Passing the converted data to the modeling tool.
Parallelizing the computation and algorithm selection.
Comparison of results on the basis of
Efficiency of different modelling algorithms
Computation on single vs multiple node.
WHAT Is Big Data and Topic Modelling?
Big Data:
Data that cannot be stored or processed by traditional computing techniques.
EXAMPLES: Black Box Data, Social Media, Space Exploration, Power and Grid Station,Search Engine Data…
Topic Modelling:
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections.
IN LAYMAN TERMS
A method of text mining to identify patterns in a corpus. Topic modeling helps us develop new ways to search,
browse and summarize large archives of texts.
VOLUME VARIETY VELOCITY
TOPIC MODELING IN IMPLEMENTATION
WHY Topic Modelling using Big Data ?
You have to create several different variables for every single word in the corpus. The models we
would be running, with roughly 2,000 documents, will get to the edge of what can be done on an
average desktop machine, and commonly take a day.
Hadoop is a framework which could provide all the facilities that are needed in modelling of such a
huge set of data.
till
2
DAYS
10 MINS
??
5 MILLION GIGABYTES OF DATA GENERATED(amount)
2003
2011
2013
2015
HADOOP and its COMPONENTS
Hadoop:
An open source framework written in JAVA.
It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.(confi)
It has two major component-
HDFS(Hadoop Distributed File System)- For the
storage.
MapReduce- Processing of data.(pgram model)
Hadoop Installation:
Cluster of 5 (in our case) commodity hardwares.
Namenode-the manager.
Datanodes- the actual storage and processing units.
COMPARISON OF EXECUTION TIME
HOW Topic Modelling is Achieved Using
Big Data Analytics?
Proposed Algorithms:
Probabilistic Latent Semantic Indexing ( PLSI) :
It is a novel statistical technique for the analysis of two-mode and co-occurrence data
Latent Dirichlet allocation (LDA):
It’s a way of automatically discovering topics that sentences contain.
Pachinko allocation
Modeling correlations between topics in addition to the word correlations which constitute topics.
`
d wz
HOW Topic Modelling is Achieved
Using Big Data Analytics?
TOOLS Model/Algorithm Language Introduction
Mallet LDA(including Naïve
Bayes, Maximum
Entropy, and Decision
Trees)
Java efficient routines for
converting text to
"features", a wide
variety of algorithms
lda R package for Gibbs
sampling in many models
R Implements many models
and is fast
hdp Hierarchical Dirichlet
processes
C++ Topic models where the
data determine the
number of topics. This
implements Gibbs
sampling.
TOPIC MODELLING TOOLS
WHERE is Topic Modeling Using Big
Data Applied?
SOME APPLICATIONS OF TOPIC MODELING INCLUDE:
● Topic Modeling for analyzing news articles.
● Topic Modeling for Page Rank in Search Engines.
● Finding patterns in genetic data, images, social graphs.
● Topic modeling on historical journals.
REFERENCES:
1. Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent Semantic
Indexing: A probabilistic analysis" (Postscript). Proceedings of ACM PODS.
2. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation".
Journal of Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.
3. Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM 55 (4): 77–84.
doi:10.1145/2133806.2133826.
4. Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD".
arXiv:1204.1956.
THANK YOU

More Related Content

Similar to Topic modeling using big data analytics

Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
Farheen Nilofer
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Tyrone Systems
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
Gianvito Siciliano
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
ijiert bestjournal
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
Alexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
Alexander Decker
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MLconf
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
PandoraPPT2
PandoraPPT2PandoraPPT2
PandoraPPT2
Ace Haidrey
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
Dan Mallinger
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
Open Analytics
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
Stratebi
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
Vijay Srinivas Agneeswaran, Ph.D
 
BDAModule-1.pptx
BDAModule-1.pptxBDAModule-1.pptx
BDAModule-1.pptx
bharathmadival8055
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
NPN Training
 

Similar to Topic modeling using big data analytics (20)

Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
PandoraPPT2
PandoraPPT2PandoraPPT2
PandoraPPT2
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
BDAModule-1.pptx
BDAModule-1.pptxBDAModule-1.pptx
BDAModule-1.pptx
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 

Recently uploaded

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

Topic modeling using big data analytics

  • 1. TOPIC MODELING USING BIG DATA ANALYTICS -BY SARAH MASUD(12-CSS-57) FARHEEN NILOFER(12-CSS-23)
  • 2. INTRODUCTION WORK FLOW: Installation of Hadoop on multiple nodes- For distributed processing. Pre processing of data set - Cleaning and conversion of data into desired format. Passing the converted data to the modeling tool. Parallelizing the computation and algorithm selection. Comparison of results on the basis of Efficiency of different modelling algorithms Computation on single vs multiple node.
  • 3. WHAT Is Big Data and Topic Modelling? Big Data: Data that cannot be stored or processed by traditional computing techniques. EXAMPLES: Black Box Data, Social Media, Space Exploration, Power and Grid Station,Search Engine Data… Topic Modelling: Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. IN LAYMAN TERMS A method of text mining to identify patterns in a corpus. Topic modeling helps us develop new ways to search, browse and summarize large archives of texts. VOLUME VARIETY VELOCITY
  • 4. TOPIC MODELING IN IMPLEMENTATION
  • 5. WHY Topic Modelling using Big Data ? You have to create several different variables for every single word in the corpus. The models we would be running, with roughly 2,000 documents, will get to the edge of what can be done on an average desktop machine, and commonly take a day. Hadoop is a framework which could provide all the facilities that are needed in modelling of such a huge set of data. till 2 DAYS 10 MINS ?? 5 MILLION GIGABYTES OF DATA GENERATED(amount) 2003 2011 2013 2015
  • 6. HADOOP and its COMPONENTS Hadoop: An open source framework written in JAVA. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.(confi) It has two major component- HDFS(Hadoop Distributed File System)- For the storage. MapReduce- Processing of data.(pgram model) Hadoop Installation: Cluster of 5 (in our case) commodity hardwares. Namenode-the manager. Datanodes- the actual storage and processing units.
  • 8. HOW Topic Modelling is Achieved Using Big Data Analytics? Proposed Algorithms: Probabilistic Latent Semantic Indexing ( PLSI) : It is a novel statistical technique for the analysis of two-mode and co-occurrence data Latent Dirichlet allocation (LDA): It’s a way of automatically discovering topics that sentences contain. Pachinko allocation Modeling correlations between topics in addition to the word correlations which constitute topics. ` d wz
  • 9. HOW Topic Modelling is Achieved Using Big Data Analytics? TOOLS Model/Algorithm Language Introduction Mallet LDA(including Naïve Bayes, Maximum Entropy, and Decision Trees) Java efficient routines for converting text to "features", a wide variety of algorithms lda R package for Gibbs sampling in many models R Implements many models and is fast hdp Hierarchical Dirichlet processes C++ Topic models where the data determine the number of topics. This implements Gibbs sampling. TOPIC MODELLING TOOLS
  • 10. WHERE is Topic Modeling Using Big Data Applied? SOME APPLICATIONS OF TOPIC MODELING INCLUDE: ● Topic Modeling for analyzing news articles. ● Topic Modeling for Page Rank in Search Engines. ● Finding patterns in genetic data, images, social graphs. ● Topic modeling on historical journals.
  • 11. REFERENCES: 1. Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent Semantic Indexing: A probabilistic analysis" (Postscript). Proceedings of ACM PODS. 2. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993. 3. Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM 55 (4): 77–84. doi:10.1145/2133806.2133826. 4. Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD". arXiv:1204.1956.