Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Smart Searching Through
Trillion Research Papers
with Apache Spark ML
Himanshu Gupta, Knoldus Inc.
#SAISEco3
About Me
❑ Lead Consultant (Engineering) at Knoldus Inc.
❑ Work on reactive and streaming fast data solutions by leveragin...
Agenda
The Need
Challenges
Our Solution
Future work
#SAISEco3
S
The Need: Make Better Decisions Faster
How much does
it cost to get a
Car from Concept
phase to
Sales floor ?
Typically ...
S
The Need: Make Better Decisions Faster (contd.)
How long does it
take to get a
new drug to
market?
It takes
10-12 years ...
❑ In June, 2018, Tata motors produced just one unit of Nano (world’s cheapest car).
❑ In case of few diseases the success ...
Best Solution:
Leverage the Work Done
Pharma companies partner with Research
Organizations and Academic Institutes to
redu...
The Challenge:
It is Difficult
#SAISEco3
❑ R&D data is extremely complex.
❑ Each and every research work have a specific A...
❑ Where all the work done (research papers/articles) are collated.
❑ Allow easy access to the relevant research work.
❑ Di...
Design Philosophy
#SAISEco3
❑ Extracting content from Research papers/articles is a
time consuming and tiring process.
❑ Requires expertise of SME(s)....
Read
Documents
Read Research Papers
/Articles from S3 / HDFS
Index
Index the content
in to:
• Title
• Structure
• Special ...
Step 1: Extract Content (Output)
#SAISEco3
Word1 Word2 Word3
Doc1 Count Count Count
Doc2 Count Count Count
Doc3 Count Coun...
Step 2: Analyze Content (First Iteration)
#SAISEco3
It takes in a collection of documents
as vectors of word counts along ...
Step 2: Analyze Content (LDA Output)
#SAISEco3
● Above words with term weights may not necessarily be the final chosen phr...
Step 2: Analyze Content (Identify Clusters)
#SAISEco3
#SAISEco3
Step 2: Analyze Content (Output)
Cluster of words formed from the research papers on Tuberculosis
Step 3: Store Facts (Indexing Documents)
#SAISEco3
Step 3: Store Facts (Output)
#SAISEco3
Now we can search documents on the basis of terms we want to:
select * from facts w...
Semantic Search
❑ Index Data in Elasticsearch/Solr
❑ Run semantic query over indexed data
❑ Like, How Can we Separate Gold...
+(1) 647-467-4396
https://www.facebook.com/KnoldusS
oftware/
@himanshug735
Thank You!
Stay in Touch
Upcoming SlideShare
Loading in …5
×

of

Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 1 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 2 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 3 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 4 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 5 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 6 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 7 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 8 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 9 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 10 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 11 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 12 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 13 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 14 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 15 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 16 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 17 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 18 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 19 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 20 Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta Slide 21
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta

Download to read offline

Every publication has a rich set of documents that contain information about different domains. Mostly, these documents keeps on sitting in data warehouses. If used wisely, they can prove to be a golden set for companies operating in domains like pharma, medical, or financial institutions.

For example, today it takes any pharmaceutical company upto 12 years and $2 billion to bring a single new drug to market. Despite the huge spend, scientists in Pharma don’t have a way to find the data on the work which is already done. They just redo the whole thing, wasting money on duplicate work.

The biggest challenge in making those documents searchable is that they need to be tagged with their corresponding topics for which SMEs [Subject Matter Experts] are required. SMEs would read the document and fetch the topics, tag it with the topics. This way of tagging documents is slow and expensive.

This talk explains how we can apply Spark ML to tag 100s of thousands of documents. Applying ML will not only make tagging process faster & less expensive but also can explore new fields which are overlooked by SMEs.

Smart Searching Through Trillion of Research Papers with Apache Spark ML with Himanshu Gupta

  1. 1. Smart Searching Through Trillion Research Papers with Apache Spark ML Himanshu Gupta, Knoldus Inc. #SAISEco3
  2. 2. About Me ❑ Lead Consultant (Engineering) at Knoldus Inc. ❑ Work on reactive and streaming fast data solutions by leveraging Scala/Spark ecosystem. #SAISEco3
  3. 3. Agenda The Need Challenges Our Solution Future work #SAISEco3
  4. 4. S The Need: Make Better Decisions Faster How much does it cost to get a Car from Concept phase to Sales floor ? Typically it takes 2-5 years and $1 Billion to do that. Journalist Auto Industry Expert #SAISEco3
  5. 5. S The Need: Make Better Decisions Faster (contd.) How long does it take to get a new drug to market? It takes 10-12 years and $2.5 Billion to do that. Journalist Pharmaceutical Scientist #SAISEco3
  6. 6. ❑ In June, 2018, Tata motors produced just one unit of Nano (world’s cheapest car). ❑ In case of few diseases the success rate of new drug being approved is less than 20%. Surprises can be Costly #SAISEco3
  7. 7. Best Solution: Leverage the Work Done Pharma companies partner with Research Organizations and Academic Institutes to reduce R&D cost up to 30%. . Cars in India uses common engines. 60%
  8. 8. The Challenge: It is Difficult #SAISEco3 ❑ R&D data is extremely complex. ❑ Each and every research work have a specific Aim which can overlap with other research work or not. ❑ The test environment of R&D work is different than actual world. ❑ There are many factors which are either assumed or ignored while conducting research. ❑ Facts are scattered over multiple research work.
  9. 9. ❑ Where all the work done (research papers/articles) are collated. ❑ Allow easy access to the relevant research work. ❑ Discover new fields and concepts. Our Solution: Build a Platform #SAISEco3
  10. 10. Design Philosophy #SAISEco3
  11. 11. ❑ Extracting content from Research papers/articles is a time consuming and tiring process. ❑ Requires expertise of SME(s). ❑ However, if done by systems, can become blazingly fast and cost efficient. ❑ Systems extract content from research papers/articles and store them into a database from where it can be explored Step 1: Extract Content #SAISEco3
  12. 12. Read Documents Read Research Papers /Articles from S3 / HDFS Index Index the content in to: • Title • Structure • Special Objects Enrich • Prepare N-Grams • Create MxN matrix • Index the Matrix Save Save the enriched data into Database (Cassandra) 01 02 03 04 Step 1: Extract Content (Process) To scale the extraction process we leveraged Apache Spark’s distributed computing feature #SAISEco3
  13. 13. Step 1: Extract Content (Output) #SAISEco3 Word1 Word2 Word3 Doc1 Count Count Count Doc2 Count Count Count Doc3 Count Count Count ... ... ... ... ... ... ... ... DocM Count Count Count
  14. 14. Step 2: Analyze Content (First Iteration) #SAISEco3 It takes in a collection of documents as vectors of word counts along with parameters: k, optimizer, docConcentration & topicConcentration, MaxIterations, & checkpointInterval The input was a Feature Vector of Word Counts (for each word in a bag of words) Store the results for future reference and tuning
  15. 15. Step 2: Analyze Content (LDA Output) #SAISEco3 ● Above words with term weights may not necessarily be the final chosen phrase to be identified as cluster(s). ● Because the number words that belong to cluster can be high (which is good, considering there will be several words that are ambiguous), one need to use different ways to identify phrases. Topic1 Topic2 Topic3 Word1 Term Weight Term Weight Term Weight Word2 Term Weight Term Weight Term Weight Word3 Term Weight Term Weight Term Weight ... ... ... ... ... ... ... ... WordN Term Weight Term Weight Term Weight
  16. 16. Step 2: Analyze Content (Identify Clusters) #SAISEco3
  17. 17. #SAISEco3 Step 2: Analyze Content (Output) Cluster of words formed from the research papers on Tuberculosis
  18. 18. Step 3: Store Facts (Indexing Documents) #SAISEco3
  19. 19. Step 3: Store Facts (Output) #SAISEco3 Now we can search documents on the basis of terms we want to: select * from facts where coreterms like ‘metallurgy’ Doc Id Content Cluster ID Core Terms Similarity Index Doc1 Content1 Cluster Id1 Cluster1 Terms Between (0-1) Cluster Id1:Start:Length Doc2 Content2 Cluster Id2 Cluster2 Terms Between (0-1) Cluster Id2:Start:Length ... ... ... ... ... ... DocN ContentN Cluster Id1 Cluster1 Terms Between (0-1) Cluster Id1:Start:Length
  20. 20. Semantic Search ❑ Index Data in Elasticsearch/Solr ❑ Run semantic query over indexed data ❑ Like, How Can we Separate Gold From Mercury? Or Which are the compounds which have recursive bonding with Carbon and Iron? Quality Workbench ❑ To measure the relevance of search. ❑ To tune the performance of ML algorithms. Future Work #SAISEco3
  21. 21. +(1) 647-467-4396 https://www.facebook.com/KnoldusS oftware/ @himanshug735 Thank You! Stay in Touch
  • ShelleyZhang8

    Nov. 23, 2021
  • HimanshuGupta127

    Oct. 21, 2018

Every publication has a rich set of documents that contain information about different domains. Mostly, these documents keeps on sitting in data warehouses. If used wisely, they can prove to be a golden set for companies operating in domains like pharma, medical, or financial institutions. For example, today it takes any pharmaceutical company upto 12 years and $2 billion to bring a single new drug to market. Despite the huge spend, scientists in Pharma don’t have a way to find the data on the work which is already done. They just redo the whole thing, wasting money on duplicate work. The biggest challenge in making those documents searchable is that they need to be tagged with their corresponding topics for which SMEs [Subject Matter Experts] are required. SMEs would read the document and fetch the topics, tag it with the topics. This way of tagging documents is slow and expensive. This talk explains how we can apply Spark ML to tag 100s of thousands of documents. Applying ML will not only make tagging process faster & less expensive but also can explore new fields which are overlooked by SMEs.

Views

Total views

330

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

13

Shares

0

Comments

0

Likes

2

×