Oz search

Jinsoo Kim
Jinsoo KimCEO of ETSA at Becurio
Jinsoo Kim
OZ Search
eBook, eJournal,
Paper, Patent, Judgment
Keyword
Search
Motivation for New Search Engine
Need to read
through the
document to find
the passage of
interest
Title in
here
Title in
here
• Identical Structure and Algorithm
• No differentiated value
• Innovative Structure and Algorithm
• Totally new search
Title in here
신개념
검색 서
비스
Keyword
Searches
Delayed
Indexing
Low
efficient
Resource
High
Cost
Resource
Back end
Pre/Post
processing
Limitations of the Keyword Search
Index DB Structure
Architecture of OZ Search
Memory based design Resources optimizing Indexing & searching speed
Index Structure Search and Index Algorithms
Shared
index DB
Multi-level
Hashing
Bucket
slots
Low Cost FnByte sharing
Bit type format algorithms Block Sorting
Memory
Optimizing
Word Pool
Hash index
Expansion
for key stroke
Typo Correction
Auto-completion
for every keyword
Ranking
for key strokeindex Inverted Data structure
+
OzKsana
Instant
OzBasic
Enter
OzDnS
Text block Instant
Search
OzMarker
Brand
OzAim
Big Data in
memory
Search Engines Applications Products
Similarity
Analyzer
Crony
Patent
+
.
.
.
.
.
.
Frequency rate
Precision rate
No Trade Off
Resource Sharing
Minimizing Duplication
Shared Index
DB
Memory based
Keyword location mngt
OzSearch
B+ tree
General Index ST
+
index & keywords
Slim
Engine
Through Shared Index structure,.
Saving resource by more than 50%
and guaranteeing memory based search for big
data
Trie
Frequency rate
Precision rate
Trade Off
Index Structure of OzSearch
Memory based Index DB : Employing Multi level Hash index
* How to treat collision and sort
Bucket Blocks
b~
buckets
(prime#2)
Data…
…
…
Bucket
Blocks
buckets
(Prime#1)
a~
…
Sorted data
b~ …overflow
… …
ㄱ~ …
ㄴ~ …
… …
Multi Level Hash Index(conceptual diagram)
Sort Blocks Data Set
aa~ Data(a)
ba~ Data(b)
… …
가~ …
나~ …
… …
Sorted Data Slots
Shell Sort
Data
Hash function
Slot Data Sort
Sort Data Block creation
Data: allocating corresponding block
Shell Sort
Sort blocks sequential mergingHash Collision 처리
Data  Hash Function(prime #1)  Bucket allocation
 if Bucket(n) Overflow  prime #2 Hash Function
 Next level bucket(n) creation
Sample data set
Index
k
ko
kor
kore
korea
korean
Keyword dids
ko #1, #2
korea #1, #3
korean #4
dids Keyword
#1, #2, #3, #4 k
#1, #2, #3, #4 ko
#1, #3, #4 kor
#1, #3, #4 kore
#1, #3, #4 korea
#4 korean
#1 : ko, korea #2 : ko #3 : korea #4 : korean
OzSearch일반 전방일치 구조
Index Size
Document Volume
OzSearch
Ordinary Engines
0
10
20
30
40
50
일반구조 OzSearch
pointer 수
Did 수
일반구조 OzSearch
Index 수 6 6
Keyword 수 6 3
Did 수 18 5
Pointer 수 12 9
연산 부하 중 소
비고 - -
• small index DB  Decrease OP load 
speedy search
• The bigger the data size is, the more the
resource can be saved
Resource saving
Shared index Structure
Utilizing Low Cost Functions
1) macro : processing time measurement for every module  time delay analysis
2) micro : performance check for every library / function
1) Macro analysis : google performance tool use
 processing time check(CPU profiler) for every module
 delayed modules  logic improvement or micro analysis
Sample data: Wikipedia
2) Micro analysis : atoi() function ex.
 1 bil ascii to integer conversion
 atoi() function: about 30 sec
 new code: within 0.3 ~ 3 sec
Memory data reduction technology
SNS,
Internet
……
DBMS,
File
Documents
sensor
Standard
반입 file
Standard
Input file
Bit divide
Inverted
File create
01010011
00110011
Byte
encoder
Column wise
Bit
grouping
Re-position
Code Temp
encoder
Formatter
0101
0011 x 3
acde001
defg002
fghi003
… 
1,2,3,…
acde001
defg002
fghi003
…
Output
0011
0011
1100
1100
……
- Memory (Data type simplifying + Byte sharing + slimed data ST)
- Disk I/O(usage frequency Grouping + data reduction)
OP
Analysis
40200
600
0
10000
20000
30000
40000
50000
일반알고리즘 OzSearch
대용량 자료 연산 알고리즘
(수배차량 조회 2.5억건/일)
Ex(1)
Wanted car surveillance CCTV data: 0.25 bil images/day
Intentional changes: 1 4, 38, 마머…
Require real time search
Minimum
Comparison
3 digit misrecognition 7C1 + 7C2 + 7C3 = 7 + 21 + 105 = 133 + right recog. 1 time = 134
Algorithm
General algorithm OzSearch Algorithm
3000image/s * 134 cases = 402000 tps 3000 images/s * 0.2s/image = 600 tps
Proof Recognition failure not counted
1. Word correction algorithm
2. Character comparison algorithm to find similar trade
mark
Operation Algorithm and inverted file
Ex(2)
KR trade mark search system
About 5 million trademarks invert file creation time
Algorithm
Current mechanism with datamining OzMarker
3.2 bil*0.00003 sec/case = 26.7 hr 5mil * 0.00003 sec/case = 150s
Invert file
creation
5 mil * (38 = 8 digit * 3 similar char) = 3.2 bil indiecs 5mil indices
Inverted file size & capacity comparison
20
50
30
11.7 3.2 5
100
0
20
40
60
80
100
120
예스24
(색인크기/GB)
1쇼핑몰
(색인크기/GB)
문장검색
(time/분)
호가매매
(처리용량/상대값)
경쟁사
비큐리오
31
50
4.5
0
20
40
60
Row Wise Invert File OzSearch
Big Data invert size comparison
(100만계좌 10억건 주식거래 예제)
소요 공간(GB)
32억
500만
0
100000
200000
300000
400000
기존
색인방식
OzSearch
알고리즘
Index count Comparison
(230만 유사상표 색인 자료)
index 수
402000
600
0
200000
400000
일반알고리즘 OzSearch
Big data operation Algorithm
(수배차량 조회 2.5억건/일)
tps
Index size, Searching Time comparison
(BMT results)
Memory Reduction Example
1) Row wise DB ST ≒ 31GB 2) Basic inverted data ST ≒ 50GB
* case/column increases, more storage space required
3) Memory Reduction data ST ≒
4.5GB
* case/column increases, efficiency
also increases
31
50
4.5
0
10
20
30
40
50
60
Row Wise C/W index Optimize
Index size (GB)
소요 공간
Name SSN ACC #
…
…
…
100만 * (20bytes + 13bytes + 20bytes) = 53MB
ACC # Designated
Code
Mass trx y/n?
…
…
…
10억 * (20bytes + 10bytes + 1bytes) = 31GB
~~~
Name SSN ACC #
…
…
…
100만 * (20bytes + 13bytes + 20bytes) = 53MB
Mass trx 0 ACC # …… ACC #ACC #
Mass trx 1 ACC # ACC #ACC # ……
ACC #
Designated
Code
…
…
…
~~~
(20bytes) * 10억 = 20GB
10억 * (20bytes + 10bytes) = 30GB
Original Data
53MB
TRX data
≒ 4.4GB
Example
In case1 billion trx from 1mil accounts at 10 thousand branches,
SSN, account #, Name, Mass trx check indexing (64 bits OS)
Original Data
Memory Reduction:
Data Structure and Capacity
Column SSD(Offset)
Original
Name + SSN 1M 33bytes
53MBAccount # 1M 20bytes(64bits hash indexing)
Designated # 10,000 10bytes
TRX
Mass TRX 1B 1bit 125MB
TRX 1B 20bits(1M ACC)+14bits(10,000 Designated #) 4.25GB
Memory reduction data ST and Size
Mass TRX 1B bits 0011010101000…… 110101010001101010
ACC # 1M 20 bytes …..
Designated # 10,000(10 bytes) …
…..
TRX data(10B) Abs ACC 20 bits(1M) Abs code 14 bits(10,000) ……
Analysis/OP format
53MB
125MB
4.25GB
Memory Reduction Developed
Improving
Learning func. For analysis
Data Type reduction Byte sep/share
Super light inverted data Structure
Standard user data definition API
DISK I/O reduction
Data Type
reduction
Column wise
compression
Server/Index
distributed/pararell processing
Essential algorithm for each part
I/O
Super light index DB structure
Map reduce / comm. tech
Memory Processing
Distribution operation
Parallel Processing
I/O
DISK I/O reduction
Reduce Communication
between servers
Big data analysis (NoSQL type)
More than hundred libraries
Big Data in Memory Technology
50% resource saving
search engine technology
Developing
Generalized Unstructured
mass data processing
Query
Optimizer
Core Technology
1) Search Engine indexing Structure and Algorithm
2) String Management related data structure and algorithm
3) Memory / resource efficiency enhancement library
4) Big Data in Memory related technology (based on Search Engine Technology)
Status Quo of BeCurio Technology
Product Explanation
Search
Engines
(OzSearch)
Keyword
Search
Basic
Memory/DISK based resource sharing keyword search
engine More than 50% index size
reduction
OzParser
Integrated phoneme analyzer
Just for search engine
Instant Search
Memory based
OzKsana
Real time keyword recommendation for each character
input
Real time customized ranking/indexing based on group Compare with Google instant
search
OzSniper
AND search for each character, phoneme Analyzer
Powerful spelling correction
Text block Search OzDns
Real time web based text block search
Super fast and light location data index structure
No preprocessing
Algorithms
Customized Search RERE
Real time super fast ranking for each character input
Based on keyword chain patent registration technology
Typo Correction OzFix
resource saving more than 100 times
Optimizing accuracy and flexibility
Compare with Google
Super fast search Algorithm
Auto completion and expansion for every keyword
Dramatic speed improvement and resource reduction logic
Solutions
Big Data in Memory OzAiM
Memory structure reduction data structure reduction data
structure
ANSI query (NoSql type) Analysis
Real time Group by/Order by
for more than 10B data
Similar Trade mark Search OzMarker
Based on OzFix algorithm
Super fast indexing, enhanced accuracy and flexibility
6M trade mark indexing
 24 hr : 100 sec
Prior Patent Search Crony
Avoiding search formula by experts
Avoiding existing similarity analysis algorithm
1/10 resource + 10 times
faster speed
Plagiarism Checker OzSoS
Plagiarism checker based on DnS
Super fast, high accuracy real time text block similarity
analysis
BeCurio Products
An Example of Patent Search Formula
( (web* or internet* or network*) and (brows*) and (HTML* or HTTP* or XML* or Markup*
or javascript*) ) OR (((remote* and naviga*) or (spatial adj10 naviga*) or (arrow and (key
adj10 naviga*)) or (directional and (key adj10 naviga*)) or (user adj10 interface and
naviga*)) and (brows* or menu) ) OR ((((리모컨 or 리모콘) or (화살* and 키) or (방향 and
키) or (유저 adj10 인터페이스)) and 화면 and 선택) ) OR ( (((web) and (client* or browser*)
and ((remote* adj10 control*) or (cursor adj10 navigation) or layout) or (web* or internet*
or network*) and (brows* or navigat*) and ((remote* adj10 control*) or (user adj interfac*)
or layout*))) ) OR ( (gui or presentation) and engine and (XML* or script* or Java*) ) OR
( (web and application and framework) or (web and application and platform) or (web and
rich and internet and application) or (web and ria) or (web and ajax) or (web and
asynchronous and javascript and xml) or (widget and web) or (gadget and web) or (rss and
web) or (really adj3 simple adj3 syndication adj3 web) or (web and ((smart and client) or
(smart and agent))) or (web and downloadable adj10 application) or XAML or XUL or MXML
or (interface and element and web) ) OR ( (웹 and *플리케이션 and 프레임*) or (웹 and *플
리케이션 and 플랫폼) or RIA or AJAX or *이젝스 or *이잭스 or 아작스 or 위짓 or 위젯 or
widget or 가젯 or 가짓 or gadget or RSS or (웹 and 맞춤형정보배달) or (웹 and 스마트 and
클라이언트) or (웹 and 스마트 and 에이전트) or (웹 and 스마트 and 에이젼트) or (웹 and 다
운* and *플리케이션) or XAML or XUL or MXML ) OR ( (CE or (TV or television) or DTV or
(digital adj2 (TV or television))) and (service or (web adj10 service) or (mash adj up adj10
service)) ) OR ( (Opera or Yahoo or Konfabulator or Google or Microsoft or ANT or Mozilla or
Netscape or MacroMedia or IBM or HP).AP. )
Crony: New Patent Search
Keyword search Document Search
Technology
Instant
Search
Text block
Content &
Location
Prior
Patent
Search
Text block
location search
CRONY
Accurate
Easy
Economic
Speedy
Content and its Location
No missing case
No need for any formula
Minimum Resource
Fast Creation of Indexing DB
Real Time DB updating
Crony: Expected Values
Crony System
Crony vs. Keyword Search
Legacy Patent Search
Keyword
Search Engine
Search formula
Keyword index
DnS Text
block Search
Patent DB
Keyword + location
Auto block separation 분리
Patent DBtext block
search index dB
Keyword
index DB
Keyword Search
result
Content + Location search
result
User
within seconds
by ordinary user
Days or weeks
Only by expert
Preprocessing Sentence/Paragraph
Hash Fn creation
Similarity
Analysis
Finger Printing
Cosθ
Time delay
Accuracy issue
Sizable index DB
Tweaked Sentence
issue
Similarity
Analysis
# of identical KWD,
Density, sequence, etc
No Pre-Processing
Processing role speed
Text block
search filter
Extracting the target for
refined analysis
Generic index DB search
Within a second
Extracting
thousands cases
Location data
TB filter
TB automatic separation
Identical keywords,
Distance analysis
Within a seconds
Extracting
hundreds cases
Refined
analysis
engine
Text block analysis
- Frequency, distance,
sequence
- Location calculation
Within seconds
Min. H/W
Assumption: 10 page 3M
8GB MM, more than 2CPU, PC server
Feature
Min additional job(verification DB creation,
etc) short term development and launch
Millions of
Patent data
1000 Refined analysis
engine
similarity
result
Minimize
operation
load
Rough
Text block
operation
Refined
Text block
Operation
Text block
operation
filter
DataQuantity
Text block
search Filter
1. Innovative structure by location based keyword analysis
2. Super fast and highly accurate similarity checking with 1/10 of
resource
Speed and Accuracy
• Similar keyword detection
for every keyword
• Easy to use
• User registration of similar
keywords
• Sg/Pl conversion
• tense conversion
• part of speech
conversion
Query Expansion
1. Detecting intentional search avoidance
2. Automatic Query expansion for similarity analysis
Typo correction Root Keyword
Search
• Intentional type
• correcting multiple words
with no spacing
• Powerful correction
algorithm
Other features
• Detect change in word
sequence
• Detect spacing change
• Detect partial change in
words
Sentense St
Text block Indexing 1. No pre/post processing
2. Not Using Finger Printing or other previous indexing method
3. Location data indexing for every keyword
Detecting intentional avoidance
Target data1
Prior patent detection
2
3
Instant Search Engine and others
4
Current Crony System Coverage
5
Unique functions and differentiated service
1. High speed text block search
2. Words sequence check
1. US Patent 1976~ : 3.8 million samples
2. KR Patent
1. Multiple words typo correction (better than Google)
2. Similar query expansion
3. Root word search
1. Customizable coverage and accuracy control
2. Instant search for meta data
3. Saving function for the content of interest
Text block chained search
By right mouse click
Chained Search of Similar Patents
Crony: unique text block search
Real time web based
No preprocessing
Crony vs. eTBlast by Virginia Tech
OzDnS : eTBlast
X. 사용자 맞춤 검색 조건 정의
Variable Role Effect
Disparity Distance b/w keywords  Accuracy and number of search results
Min Text
block size
Minimum # of words in text block  Control # of search results
Keyword
weight
Frequency and boundary  ranking adjustment
Max Text
block size
Maximum # of words in text  visual representation
Word gap Text block separation  Accuracy, # of search results
Word order Keyword order accordance  Accuracy
Variable Controls
Applicable areas with Crony Search
• Patent Search/ Judgment Search/ Plagiarism Checker/ Quotation
Search/ eBook Content Search, etc
• Smart Contact Center
• automated text message feedback for the known questions
• Script Search
• Jumping to the video frame matching to the script line
• Removing repeated questions
• By instantly showing the similar questions to the character input
• Data Mining Search
• Data Mining with Search Interface (Anyone can do mining)
• Hyper-Knowledge Product: Sharing Knowledge with No effort
 KM system, EDMS, Document, etc
Innovative system
 Personal search pattern and storing 
Increase core knowledge sharing
 Innovative knowledge reference / sharing
model
 Super fast instant search
 Real time web based text block search
Creating value added
knowledge network
by quality knowledge
acquisition and sharing
Creating Knowledge eco-system
Hyper Knowledge Creation Model
i. Knowledge branch creation model
ii. core contents chain  core content sharing
iii. Knowledge eco-system by specialized category
Hyper Knowledge
Current Document/Knowledge
management system
• Document life cycle management
• Document search by keyword
• Knowledge registration oriented
• Low Document utilization (Too many results)
Knowledge branching and sharing
system
• Increasing knowledge utilization by increasing
knowledge sharing
• Search and share core knowledge text block
• voluntary knowledge sharing
• Knowledge search based on text block similarity search
KMS EDMS CMS ERP… etc
Life Cycle mngt. Keyword search Document based
Not able to identify the content of
interest automatically
Search knowledge
and its location
Knowledge
branch
creation
Sharing
knowledge by
specialized
categories
Individual
saving text
block of
interest
Creating high
quality
knowledge
Legacy systems
Document mngt. sys
Hyper Knowledge
System
Individual
Knowledge
branch
Shared
Knowledge
network
Process of Knowledge Network Creation
공유 node
report(1)
Key Paragraph-1 + wikipedia/original link
Key Paragraph-21 + PCM /original link
Key Paragraph-31 + report/ link
Input Keyword : knowledge network
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
results
attachment
wikipedia
PCM…
DnS
TBS
Wikipedia
KOI
TBS
PCM
TBS
DnS
Instant
search
report
TBS
DnS
“knowledge
network”
Key Paragraph-51 + shared node Link
Knowledge
i. Individual knowledge search and management
ii. Personal core knowledge chain creation 
iii. Automatic Knowledge network creation by specialized categories
Enhancing Knowledge creation
10 core
paragraphs
10 core
paragraphs
10 core
paragraphs
Infinite Knowledge Network
(10 documents 10 core knowledge)
Theoretical knowledge combination
10 * 10 * …… * 10 = (10)10 ≒ ∞
…
…
…
Knowledge sharing
저장
Key Paragraph-21
Key Paragraph-1
Key Paragraph-31
Common Interest
Core Knowledge Sharing Map
Title : Knowledge block?
Author : BeCurio Research Center
sentences………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
……………………………………….
………………………………………..
Key Paragraph -1 ----------------------------
-----------------------------------------------------------
Key Paragraph -2 ----------------------------
-----------------------------------------------------------
Key Paragraph -3 ---------------------------
-----------------------------------------------------------
------------------------------------
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공
유 node Link + …
report(1)
KOI drag
…(2)
KOI drag
…(3)
KOI drag
Post Docs.
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공
유 node Link + …
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공
유 node Link + …
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공
유 node Link + …
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공유 node Link + …
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공유 node Link + …
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공유 node Link + …
Key Paragraph-1 + wikipedia/원문 link
Key Paragraph-21 + PCM /원문 link
Key Paragraph-31 + 보고서/ link
Input Keyword : Samsung
Date : 2013. 2. 15
Reference Docs : wikipedia / 15 paragraph
질의어 + 출처 + 주요 문장 + 공유 node Link + …
Prior Docs.
sharing node table
Kwd/TB Search
 doc name,
 Text block location
ref. freq (Line thickness)
 post doc name,
 original Link
sharing node table
creation 
sharing node Link
공유 map 생성
i. Extracting Core knowledge from huge documents
ii. Sharing by core text block similarity
iii. Creating knowledge links by specialized categories
Core Knowledge Sharing
K
box
Growing Knowledge Eco-System
Individual chain
Hyper Knowledge Creation  Growing knowledge eco-system
Knowledge network
creation process
Individual search  knowledge saving  share node creation  knowledge network creation
 growing knowledge eco-system
Search technology Instant search  TBS for text body  Chain TBS for knowledge
effect
dramatically improving knowledge sharing, increasing core knowledge acquisition opportunity and
saving time  high level knowledge creation
Category
Key Paragraph-1
Key Paragraph-21
share node
KP#12
KP#13
KP#22
KP#23
KP#31
KP#32
KP#311
KP#312
KP#n1
KP#n2
report
doc
www
K. network
.
.
.
.
.
.
.
.
.
.
.
.
expansion
…
…
…
paper
Instance Search with 3 no’s
Keyword
+ Waiting Search Result
or No Result
Navigation
or re-search
Current Search Engine
Dramatic Improvement of Speed, Quality and Easy of Use
No enter key
No Waiting
No Zero Result
zKeyqword
Search result
(keyword)
no result ?
Instant Correction for No Result
Typo/No result Keyword input
New Search
Result provided by each input key stroke
Purchase Department Instant Search Stock Management
Category for individual purchase
person
(IP / log in ID)
Raw Material information for
MEMORY
Current Stock information
Basic product info., related
company info.
 Instant Search
Text block Drag & search
Super-fast text block content and
location search
Based on dramatically improved
Search speed,
a new algorithm
for text block search applied
Provide super-fast and customized search for each character input
to more than 10,000 departments in S group
Supplier
(domestic/foreign)
In-stock or
supply
information
Character Input
“M E M O R Y”
Customized Instant Search by Department
z Instant Search
Recommend relevant item for
each character input
Attachment, manual, product detail, content,
etc
Text block Search
z Text block
chain search
(non)login  Search
 Cart  Purchase
Recommending Target books ManagingUser/Group Behavior Pattern
Personal
Pattern
Real time
recommendation
Recommend
Filter
Group
Pattern
Steve Jobs
Search
- Goods Attributes
- MD managing
points
- Target DB
Book Recommendation Service RERE
category
author
Recommendation
Accuracy
(including MD )
event
Search
Pattern
Recommend
Filter
Recommend
Filter
Real time
Behavior log
CRM/log
DB
Product
property
Utilizing input keyword, click data and purchase history
10. 6. am 07: 00
Data propagation  presentation… book purchase / iphone4S order
Real time pattern
am 07: 10
Category/author/event score
curve
am 08: 30
Real time recommendation
MD manual
recommendation
Learning
Similar Trademark Search OzMarker
Example
KR trade mark search system
About 5 million trademarks invert file creation time
Algorithm
Current mechanism with datamining OzMarker
3.2 bil*0.00003 sec/case = 26.7 hr 5mil * 0.00003 sec/case = 150s
Invert file creation 5 mil * (38 = 8 digit * 3 similar char) = 3.2 bil indiecs 5mil indices
Current OzMarker
Processing
mechanism
Similar character Indexing Typo correction algorithm
Better than Google typo correction
correction algorithm
Indexing time and
and size
About 24 hr, 10 GB About 100 sec, 500 MB KR Trademark 5 million data
Accuracy
Depends on similarity definition
definition
Similar character, similarly
pronounced words algorithm
Independent of languages
Easy of Use
Delay of new trademark
registration
Registration of new trademark
trademark within a few seconds
Expandability
Applying a new pattern requires
requires overall indexing
No overall indexing Independent of languages
Big data in Memory solution, AiM
1. Basic Fn(Ansi Query)
2. User Defined Fn
3. Statistics Fn
4. Other data analysis
Fn
User
Defined
Basic
Fn
+ =
- > <
Data analysis tool
Add
New pattern
High level memory utilization
Efficient memory use
Super light Search
Engine Tech
Massive data analysis partSearch engine part
AiM
Existing Search Engine TechGeneral
Solution
Meet user requirements such as data
analysis speed, analysis tool and
statistical methods
Big Data
Structure and Algorithm
Fast and convenient
1 of 36

Recommended

Back to Basics Webinar 1: Introduction to NoSQL by
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
12.2K views38 slides
MongoDB Aggregation Performance by
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB
3.3K views100 slides
Using MongoDB + Hadoop Together by
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
6.2K views43 slides
Mongo db pefrormance optimization strategies by
Mongo db pefrormance optimization strategiesMongo db pefrormance optimization strategies
Mongo db pefrormance optimization strategiesronwarshawsky
1.1K views19 slides
Webinar: Best Practices for Getting Started with MongoDB by
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBMongoDB
6.8K views73 slides
Working with JSON Data in PostgreSQL vs. MongoDB by
Working with JSON Data in PostgreSQL vs. MongoDBWorking with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBScaleGrid.io
830 views46 slides

More Related Content

What's hot

MongoDB 2.4 and spring data by
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring dataJimmy Ray
11K views85 slides
Back to Basics Webinar 1: Introduction to NoSQL by
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
7.7K views28 slides
Data modeling for Elasticsearch by
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for ElasticsearchFlorian Hopf
12.7K views99 slides
MongodB Internals by
MongodB InternalsMongodB Internals
MongodB InternalsNorberto Leite
10.3K views52 slides
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite... by
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Lucidworks
9K views41 slides
MongoDB by
MongoDBMongoDB
MongoDBAnthony Slabinck
717 views52 slides

What's hot(20)

MongoDB 2.4 and spring data by Jimmy Ray
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
Jimmy Ray11K views
Back to Basics Webinar 1: Introduction to NoSQL by MongoDB
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB7.7K views
Data modeling for Elasticsearch by Florian Hopf
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
Florian Hopf12.7K views
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite... by Lucidworks
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Lucidworks9K views
MongoDB and Hadoop: Driving Business Insights by MongoDB
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
MongoDB20.7K views
Building Spring Data with MongoDB by MongoDB
Building Spring Data with MongoDBBuilding Spring Data with MongoDB
Building Spring Data with MongoDB
MongoDB3.2K views
Introduction to Apache Tajo: Future of Data Warehouse by Jihoon Son
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son2K views
Solr 6.0 Graph Query Overview by Kevin Watters
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
Kevin Watters2.5K views
Xapian vs sphinx by panjunyong
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinx
panjunyong2.7K views
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov by Nikolay Samokhvalov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Nikolay Samokhvalov2.7K views
Common MongoDB Use Cases by DATAVERSITY
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY11K views
Introduction of search engine by Jinglun Li
Introduction of search engineIntroduction of search engine
Introduction of search engine
Jinglun Li307 views
Solr Graph Query: Presented by Kevin Watters, KMW Technology by Lucidworks
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Lucidworks4.6K views
Building Client-side Search Applications with Solr by lucenerevolution
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution5.2K views
Searching Relational Data with Elasticsearch by sirensolutions
Searching Relational Data with ElasticsearchSearching Relational Data with Elasticsearch
Searching Relational Data with Elasticsearch
sirensolutions11K views
No SQL : Which way to go? Presented at DDDMelbourne 2015 by Himanshu Desai
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
Himanshu Desai658 views

Similar to Oz search

Improve Performance in Fast Search for SharePoint - Comperio by
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioComperio - Search Matters.
1.1K views20 slides
ALM Search Presentation for the VSS Arch Council by
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
256 views25 slides
February 2016 Webinar Series - Introduction to DynamoDB by
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBAmazon Web Services
4.3K views86 slides
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio... by
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...Amazon Web Services
1.1K views47 slides
AWS Data Collection & Storage by
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & StorageAmazon Web Services
9K views76 slides
Look Ma! No more blobs by
Look Ma! No more blobsLook Ma! No more blobs
Look Ma! No more blobsAparna Chaudhary
1.7K views33 slides

Similar to Oz search (20)

ALM Search Presentation for the VSS Arch Council by Sunita Shrivastava
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava256 views
February 2016 Webinar Series - Introduction to DynamoDB by Amazon Web Services
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
Amazon Web Services4.3K views
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio... by Amazon Web Services
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
Amazon Web Services1.1K views
Time Series Databases for IoT (On-premises and Azure) by Ivo Andreev
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
Ivo Andreev25.7K views
Apache IOTDB: a Time Series Database for Industrial IoT by jixuan1989
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
jixuan19893.2K views
MySQL And Search At Craigslist by Jeremy Zawodny
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
Jeremy Zawodny14.8K views
Realtime Analytics on AWS by Sungmin Kim
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
Sungmin Kim503 views
Watson christofer j_180208 by IBM Sverige
Watson christofer j_180208Watson christofer j_180208
Watson christofer j_180208
IBM Sverige125 views
Tagging search solution design Advanced edition by Alexander Tokarev
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
Alexander Tokarev165 views
Agility and Scalability with MongoDB by MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB2.5K views
Object multifunctional indexing with an open API by akvalex
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
akvalex183 views
Ten things to consider for interactive analytics on write once workloads by Abinasha Karana
Ten things to consider for interactive analytics on write once workloadsTen things to consider for interactive analytics on write once workloads
Ten things to consider for interactive analytics on write once workloads
Abinasha Karana1.9K views

Recently uploaded

Why and How CloudStack at weSystems - Stephan Bienek - weSystems by
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsShapeBlue
197 views13 slides
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
176 views20 slides
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
154 views62 slides
Cencora Executive Symposium by
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
139 views14 slides
Uni Systems for Power Platform.pptx by
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptxUni Systems S.M.S.A.
61 views21 slides
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...ShapeBlue
144 views12 slides

Recently uploaded(20)

Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue197 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue154 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue144 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue179 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue88 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue79 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue123 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue98 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson156 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue103 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software385 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue94 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue158 views

Oz search

  • 2. eBook, eJournal, Paper, Patent, Judgment Keyword Search Motivation for New Search Engine Need to read through the document to find the passage of interest
  • 3. Title in here Title in here • Identical Structure and Algorithm • No differentiated value • Innovative Structure and Algorithm • Totally new search Title in here 신개념 검색 서 비스 Keyword Searches Delayed Indexing Low efficient Resource High Cost Resource Back end Pre/Post processing Limitations of the Keyword Search
  • 4. Index DB Structure Architecture of OZ Search Memory based design Resources optimizing Indexing & searching speed Index Structure Search and Index Algorithms Shared index DB Multi-level Hashing Bucket slots Low Cost FnByte sharing Bit type format algorithms Block Sorting Memory Optimizing Word Pool Hash index Expansion for key stroke Typo Correction Auto-completion for every keyword Ranking for key strokeindex Inverted Data structure + OzKsana Instant OzBasic Enter OzDnS Text block Instant Search OzMarker Brand OzAim Big Data in memory Search Engines Applications Products Similarity Analyzer Crony Patent + . . . . . .
  • 5. Frequency rate Precision rate No Trade Off Resource Sharing Minimizing Duplication Shared Index DB Memory based Keyword location mngt OzSearch B+ tree General Index ST + index & keywords Slim Engine Through Shared Index structure,. Saving resource by more than 50% and guaranteeing memory based search for big data Trie Frequency rate Precision rate Trade Off Index Structure of OzSearch
  • 6. Memory based Index DB : Employing Multi level Hash index * How to treat collision and sort Bucket Blocks b~ buckets (prime#2) Data… … … Bucket Blocks buckets (Prime#1) a~ … Sorted data b~ …overflow … … ㄱ~ … ㄴ~ … … … Multi Level Hash Index(conceptual diagram) Sort Blocks Data Set aa~ Data(a) ba~ Data(b) … … 가~ … 나~ … … … Sorted Data Slots Shell Sort Data Hash function Slot Data Sort Sort Data Block creation Data: allocating corresponding block Shell Sort Sort blocks sequential mergingHash Collision 처리 Data  Hash Function(prime #1)  Bucket allocation  if Bucket(n) Overflow  prime #2 Hash Function  Next level bucket(n) creation
  • 7. Sample data set Index k ko kor kore korea korean Keyword dids ko #1, #2 korea #1, #3 korean #4 dids Keyword #1, #2, #3, #4 k #1, #2, #3, #4 ko #1, #3, #4 kor #1, #3, #4 kore #1, #3, #4 korea #4 korean #1 : ko, korea #2 : ko #3 : korea #4 : korean OzSearch일반 전방일치 구조 Index Size Document Volume OzSearch Ordinary Engines 0 10 20 30 40 50 일반구조 OzSearch pointer 수 Did 수 일반구조 OzSearch Index 수 6 6 Keyword 수 6 3 Did 수 18 5 Pointer 수 12 9 연산 부하 중 소 비고 - - • small index DB  Decrease OP load  speedy search • The bigger the data size is, the more the resource can be saved Resource saving Shared index Structure
  • 8. Utilizing Low Cost Functions 1) macro : processing time measurement for every module  time delay analysis 2) micro : performance check for every library / function 1) Macro analysis : google performance tool use  processing time check(CPU profiler) for every module  delayed modules  logic improvement or micro analysis Sample data: Wikipedia 2) Micro analysis : atoi() function ex.  1 bil ascii to integer conversion  atoi() function: about 30 sec  new code: within 0.3 ~ 3 sec
  • 9. Memory data reduction technology SNS, Internet …… DBMS, File Documents sensor Standard 반입 file Standard Input file Bit divide Inverted File create 01010011 00110011 Byte encoder Column wise Bit grouping Re-position Code Temp encoder Formatter 0101 0011 x 3 acde001 defg002 fghi003 …  1,2,3,… acde001 defg002 fghi003 … Output 0011 0011 1100 1100 …… - Memory (Data type simplifying + Byte sharing + slimed data ST) - Disk I/O(usage frequency Grouping + data reduction) OP Analysis
  • 10. 40200 600 0 10000 20000 30000 40000 50000 일반알고리즘 OzSearch 대용량 자료 연산 알고리즘 (수배차량 조회 2.5억건/일) Ex(1) Wanted car surveillance CCTV data: 0.25 bil images/day Intentional changes: 1 4, 38, 마머… Require real time search Minimum Comparison 3 digit misrecognition 7C1 + 7C2 + 7C3 = 7 + 21 + 105 = 133 + right recog. 1 time = 134 Algorithm General algorithm OzSearch Algorithm 3000image/s * 134 cases = 402000 tps 3000 images/s * 0.2s/image = 600 tps Proof Recognition failure not counted 1. Word correction algorithm 2. Character comparison algorithm to find similar trade mark Operation Algorithm and inverted file Ex(2) KR trade mark search system About 5 million trademarks invert file creation time Algorithm Current mechanism with datamining OzMarker 3.2 bil*0.00003 sec/case = 26.7 hr 5mil * 0.00003 sec/case = 150s Invert file creation 5 mil * (38 = 8 digit * 3 similar char) = 3.2 bil indiecs 5mil indices
  • 11. Inverted file size & capacity comparison 20 50 30 11.7 3.2 5 100 0 20 40 60 80 100 120 예스24 (색인크기/GB) 1쇼핑몰 (색인크기/GB) 문장검색 (time/분) 호가매매 (처리용량/상대값) 경쟁사 비큐리오 31 50 4.5 0 20 40 60 Row Wise Invert File OzSearch Big Data invert size comparison (100만계좌 10억건 주식거래 예제) 소요 공간(GB) 32억 500만 0 100000 200000 300000 400000 기존 색인방식 OzSearch 알고리즘 Index count Comparison (230만 유사상표 색인 자료) index 수 402000 600 0 200000 400000 일반알고리즘 OzSearch Big data operation Algorithm (수배차량 조회 2.5억건/일) tps Index size, Searching Time comparison (BMT results)
  • 12. Memory Reduction Example 1) Row wise DB ST ≒ 31GB 2) Basic inverted data ST ≒ 50GB * case/column increases, more storage space required 3) Memory Reduction data ST ≒ 4.5GB * case/column increases, efficiency also increases 31 50 4.5 0 10 20 30 40 50 60 Row Wise C/W index Optimize Index size (GB) 소요 공간 Name SSN ACC # … … … 100만 * (20bytes + 13bytes + 20bytes) = 53MB ACC # Designated Code Mass trx y/n? … … … 10억 * (20bytes + 10bytes + 1bytes) = 31GB ~~~ Name SSN ACC # … … … 100만 * (20bytes + 13bytes + 20bytes) = 53MB Mass trx 0 ACC # …… ACC #ACC # Mass trx 1 ACC # ACC #ACC # …… ACC # Designated Code … … … ~~~ (20bytes) * 10억 = 20GB 10억 * (20bytes + 10bytes) = 30GB Original Data 53MB TRX data ≒ 4.4GB Example In case1 billion trx from 1mil accounts at 10 thousand branches, SSN, account #, Name, Mass trx check indexing (64 bits OS)
  • 13. Original Data Memory Reduction: Data Structure and Capacity Column SSD(Offset) Original Name + SSN 1M 33bytes 53MBAccount # 1M 20bytes(64bits hash indexing) Designated # 10,000 10bytes TRX Mass TRX 1B 1bit 125MB TRX 1B 20bits(1M ACC)+14bits(10,000 Designated #) 4.25GB Memory reduction data ST and Size Mass TRX 1B bits 0011010101000…… 110101010001101010 ACC # 1M 20 bytes ….. Designated # 10,000(10 bytes) … ….. TRX data(10B) Abs ACC 20 bits(1M) Abs code 14 bits(10,000) …… Analysis/OP format 53MB 125MB 4.25GB
  • 14. Memory Reduction Developed Improving Learning func. For analysis Data Type reduction Byte sep/share Super light inverted data Structure Standard user data definition API DISK I/O reduction Data Type reduction Column wise compression Server/Index distributed/pararell processing Essential algorithm for each part I/O Super light index DB structure Map reduce / comm. tech Memory Processing Distribution operation Parallel Processing I/O DISK I/O reduction Reduce Communication between servers Big data analysis (NoSQL type) More than hundred libraries Big Data in Memory Technology 50% resource saving search engine technology Developing Generalized Unstructured mass data processing Query Optimizer Core Technology 1) Search Engine indexing Structure and Algorithm 2) String Management related data structure and algorithm 3) Memory / resource efficiency enhancement library 4) Big Data in Memory related technology (based on Search Engine Technology) Status Quo of BeCurio Technology
  • 15. Product Explanation Search Engines (OzSearch) Keyword Search Basic Memory/DISK based resource sharing keyword search engine More than 50% index size reduction OzParser Integrated phoneme analyzer Just for search engine Instant Search Memory based OzKsana Real time keyword recommendation for each character input Real time customized ranking/indexing based on group Compare with Google instant search OzSniper AND search for each character, phoneme Analyzer Powerful spelling correction Text block Search OzDns Real time web based text block search Super fast and light location data index structure No preprocessing Algorithms Customized Search RERE Real time super fast ranking for each character input Based on keyword chain patent registration technology Typo Correction OzFix resource saving more than 100 times Optimizing accuracy and flexibility Compare with Google Super fast search Algorithm Auto completion and expansion for every keyword Dramatic speed improvement and resource reduction logic Solutions Big Data in Memory OzAiM Memory structure reduction data structure reduction data structure ANSI query (NoSql type) Analysis Real time Group by/Order by for more than 10B data Similar Trade mark Search OzMarker Based on OzFix algorithm Super fast indexing, enhanced accuracy and flexibility 6M trade mark indexing  24 hr : 100 sec Prior Patent Search Crony Avoiding search formula by experts Avoiding existing similarity analysis algorithm 1/10 resource + 10 times faster speed Plagiarism Checker OzSoS Plagiarism checker based on DnS Super fast, high accuracy real time text block similarity analysis BeCurio Products
  • 16. An Example of Patent Search Formula ( (web* or internet* or network*) and (brows*) and (HTML* or HTTP* or XML* or Markup* or javascript*) ) OR (((remote* and naviga*) or (spatial adj10 naviga*) or (arrow and (key adj10 naviga*)) or (directional and (key adj10 naviga*)) or (user adj10 interface and naviga*)) and (brows* or menu) ) OR ((((리모컨 or 리모콘) or (화살* and 키) or (방향 and 키) or (유저 adj10 인터페이스)) and 화면 and 선택) ) OR ( (((web) and (client* or browser*) and ((remote* adj10 control*) or (cursor adj10 navigation) or layout) or (web* or internet* or network*) and (brows* or navigat*) and ((remote* adj10 control*) or (user adj interfac*) or layout*))) ) OR ( (gui or presentation) and engine and (XML* or script* or Java*) ) OR ( (web and application and framework) or (web and application and platform) or (web and rich and internet and application) or (web and ria) or (web and ajax) or (web and asynchronous and javascript and xml) or (widget and web) or (gadget and web) or (rss and web) or (really adj3 simple adj3 syndication adj3 web) or (web and ((smart and client) or (smart and agent))) or (web and downloadable adj10 application) or XAML or XUL or MXML or (interface and element and web) ) OR ( (웹 and *플리케이션 and 프레임*) or (웹 and *플 리케이션 and 플랫폼) or RIA or AJAX or *이젝스 or *이잭스 or 아작스 or 위짓 or 위젯 or widget or 가젯 or 가짓 or gadget or RSS or (웹 and 맞춤형정보배달) or (웹 and 스마트 and 클라이언트) or (웹 and 스마트 and 에이전트) or (웹 and 스마트 and 에이젼트) or (웹 and 다 운* and *플리케이션) or XAML or XUL or MXML ) OR ( (CE or (TV or television) or DTV or (digital adj2 (TV or television))) and (service or (web adj10 service) or (mash adj up adj10 service)) ) OR ( (Opera or Yahoo or Konfabulator or Google or Microsoft or ANT or Mozilla or Netscape or MacroMedia or IBM or HP).AP. )
  • 17. Crony: New Patent Search Keyword search Document Search Technology Instant Search Text block Content & Location Prior Patent Search Text block location search
  • 18. CRONY Accurate Easy Economic Speedy Content and its Location No missing case No need for any formula Minimum Resource Fast Creation of Indexing DB Real Time DB updating Crony: Expected Values
  • 19. Crony System Crony vs. Keyword Search Legacy Patent Search Keyword Search Engine Search formula Keyword index DnS Text block Search Patent DB Keyword + location Auto block separation 분리 Patent DBtext block search index dB Keyword index DB Keyword Search result Content + Location search result User within seconds by ordinary user Days or weeks Only by expert Preprocessing Sentence/Paragraph Hash Fn creation Similarity Analysis Finger Printing Cosθ Time delay Accuracy issue Sizable index DB Tweaked Sentence issue Similarity Analysis # of identical KWD, Density, sequence, etc No Pre-Processing
  • 20. Processing role speed Text block search filter Extracting the target for refined analysis Generic index DB search Within a second Extracting thousands cases Location data TB filter TB automatic separation Identical keywords, Distance analysis Within a seconds Extracting hundreds cases Refined analysis engine Text block analysis - Frequency, distance, sequence - Location calculation Within seconds Min. H/W Assumption: 10 page 3M 8GB MM, more than 2CPU, PC server Feature Min additional job(verification DB creation, etc) short term development and launch Millions of Patent data 1000 Refined analysis engine similarity result Minimize operation load Rough Text block operation Refined Text block Operation Text block operation filter DataQuantity Text block search Filter 1. Innovative structure by location based keyword analysis 2. Super fast and highly accurate similarity checking with 1/10 of resource Speed and Accuracy
  • 21. • Similar keyword detection for every keyword • Easy to use • User registration of similar keywords • Sg/Pl conversion • tense conversion • part of speech conversion Query Expansion 1. Detecting intentional search avoidance 2. Automatic Query expansion for similarity analysis Typo correction Root Keyword Search • Intentional type • correcting multiple words with no spacing • Powerful correction algorithm Other features • Detect change in word sequence • Detect spacing change • Detect partial change in words Sentense St
  • 22. Text block Indexing 1. No pre/post processing 2. Not Using Finger Printing or other previous indexing method 3. Location data indexing for every keyword Detecting intentional avoidance Target data1 Prior patent detection 2 3 Instant Search Engine and others 4 Current Crony System Coverage 5 Unique functions and differentiated service 1. High speed text block search 2. Words sequence check 1. US Patent 1976~ : 3.8 million samples 2. KR Patent 1. Multiple words typo correction (better than Google) 2. Similar query expansion 3. Root word search 1. Customizable coverage and accuracy control 2. Instant search for meta data 3. Saving function for the content of interest
  • 23. Text block chained search By right mouse click Chained Search of Similar Patents
  • 24. Crony: unique text block search Real time web based No preprocessing Crony vs. eTBlast by Virginia Tech OzDnS : eTBlast
  • 25. X. 사용자 맞춤 검색 조건 정의 Variable Role Effect Disparity Distance b/w keywords  Accuracy and number of search results Min Text block size Minimum # of words in text block  Control # of search results Keyword weight Frequency and boundary  ranking adjustment Max Text block size Maximum # of words in text  visual representation Word gap Text block separation  Accuracy, # of search results Word order Keyword order accordance  Accuracy Variable Controls
  • 26. Applicable areas with Crony Search • Patent Search/ Judgment Search/ Plagiarism Checker/ Quotation Search/ eBook Content Search, etc • Smart Contact Center • automated text message feedback for the known questions • Script Search • Jumping to the video frame matching to the script line • Removing repeated questions • By instantly showing the similar questions to the character input • Data Mining Search • Data Mining with Search Interface (Anyone can do mining) • Hyper-Knowledge Product: Sharing Knowledge with No effort
  • 27.  KM system, EDMS, Document, etc Innovative system  Personal search pattern and storing  Increase core knowledge sharing  Innovative knowledge reference / sharing model  Super fast instant search  Real time web based text block search Creating value added knowledge network by quality knowledge acquisition and sharing Creating Knowledge eco-system Hyper Knowledge Creation Model i. Knowledge branch creation model ii. core contents chain  core content sharing iii. Knowledge eco-system by specialized category
  • 28. Hyper Knowledge Current Document/Knowledge management system • Document life cycle management • Document search by keyword • Knowledge registration oriented • Low Document utilization (Too many results) Knowledge branching and sharing system • Increasing knowledge utilization by increasing knowledge sharing • Search and share core knowledge text block • voluntary knowledge sharing • Knowledge search based on text block similarity search KMS EDMS CMS ERP… etc Life Cycle mngt. Keyword search Document based Not able to identify the content of interest automatically Search knowledge and its location Knowledge branch creation Sharing knowledge by specialized categories Individual saving text block of interest Creating high quality knowledge Legacy systems Document mngt. sys Hyper Knowledge System Individual Knowledge branch Shared Knowledge network
  • 29. Process of Knowledge Network Creation 공유 node report(1) Key Paragraph-1 + wikipedia/original link Key Paragraph-21 + PCM /original link Key Paragraph-31 + report/ link Input Keyword : knowledge network Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph results attachment wikipedia PCM… DnS TBS Wikipedia KOI TBS PCM TBS DnS Instant search report TBS DnS “knowledge network” Key Paragraph-51 + shared node Link Knowledge i. Individual knowledge search and management ii. Personal core knowledge chain creation  iii. Automatic Knowledge network creation by specialized categories Enhancing Knowledge creation 10 core paragraphs 10 core paragraphs 10 core paragraphs Infinite Knowledge Network (10 documents 10 core knowledge) Theoretical knowledge combination 10 * 10 * …… * 10 = (10)10 ≒ ∞ … … … Knowledge sharing 저장 Key Paragraph-21 Key Paragraph-1 Key Paragraph-31 Common Interest
  • 30. Core Knowledge Sharing Map Title : Knowledge block? Author : BeCurio Research Center sentences…………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ………………………………………. ……………………………………….. Key Paragraph -1 ---------------------------- ----------------------------------------------------------- Key Paragraph -2 ---------------------------- ----------------------------------------------------------- Key Paragraph -3 --------------------------- ----------------------------------------------------------- ------------------------------------ Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공 유 node Link + … report(1) KOI drag …(2) KOI drag …(3) KOI drag Post Docs. Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공 유 node Link + … Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공 유 node Link + … Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공 유 node Link + … Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공유 node Link + … Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공유 node Link + … Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공유 node Link + … Key Paragraph-1 + wikipedia/원문 link Key Paragraph-21 + PCM /원문 link Key Paragraph-31 + 보고서/ link Input Keyword : Samsung Date : 2013. 2. 15 Reference Docs : wikipedia / 15 paragraph 질의어 + 출처 + 주요 문장 + 공유 node Link + … Prior Docs. sharing node table Kwd/TB Search  doc name,  Text block location ref. freq (Line thickness)  post doc name,  original Link sharing node table creation  sharing node Link 공유 map 생성 i. Extracting Core knowledge from huge documents ii. Sharing by core text block similarity iii. Creating knowledge links by specialized categories Core Knowledge Sharing
  • 31. K box Growing Knowledge Eco-System Individual chain Hyper Knowledge Creation  Growing knowledge eco-system Knowledge network creation process Individual search  knowledge saving  share node creation  knowledge network creation  growing knowledge eco-system Search technology Instant search  TBS for text body  Chain TBS for knowledge effect dramatically improving knowledge sharing, increasing core knowledge acquisition opportunity and saving time  high level knowledge creation Category Key Paragraph-1 Key Paragraph-21 share node KP#12 KP#13 KP#22 KP#23 KP#31 KP#32 KP#311 KP#312 KP#n1 KP#n2 report doc www K. network . . . . . . . . . . . . expansion … … … paper
  • 32. Instance Search with 3 no’s Keyword + Waiting Search Result or No Result Navigation or re-search Current Search Engine Dramatic Improvement of Speed, Quality and Easy of Use No enter key No Waiting No Zero Result zKeyqword Search result (keyword) no result ? Instant Correction for No Result Typo/No result Keyword input New Search Result provided by each input key stroke
  • 33. Purchase Department Instant Search Stock Management Category for individual purchase person (IP / log in ID) Raw Material information for MEMORY Current Stock information Basic product info., related company info.  Instant Search Text block Drag & search Super-fast text block content and location search Based on dramatically improved Search speed, a new algorithm for text block search applied Provide super-fast and customized search for each character input to more than 10,000 departments in S group Supplier (domestic/foreign) In-stock or supply information Character Input “M E M O R Y” Customized Instant Search by Department z Instant Search Recommend relevant item for each character input Attachment, manual, product detail, content, etc Text block Search z Text block chain search
  • 34. (non)login  Search  Cart  Purchase Recommending Target books ManagingUser/Group Behavior Pattern Personal Pattern Real time recommendation Recommend Filter Group Pattern Steve Jobs Search - Goods Attributes - MD managing points - Target DB Book Recommendation Service RERE category author Recommendation Accuracy (including MD ) event Search Pattern Recommend Filter Recommend Filter Real time Behavior log CRM/log DB Product property Utilizing input keyword, click data and purchase history 10. 6. am 07: 00 Data propagation  presentation… book purchase / iphone4S order Real time pattern am 07: 10 Category/author/event score curve am 08: 30 Real time recommendation MD manual recommendation Learning
  • 35. Similar Trademark Search OzMarker Example KR trade mark search system About 5 million trademarks invert file creation time Algorithm Current mechanism with datamining OzMarker 3.2 bil*0.00003 sec/case = 26.7 hr 5mil * 0.00003 sec/case = 150s Invert file creation 5 mil * (38 = 8 digit * 3 similar char) = 3.2 bil indiecs 5mil indices Current OzMarker Processing mechanism Similar character Indexing Typo correction algorithm Better than Google typo correction correction algorithm Indexing time and and size About 24 hr, 10 GB About 100 sec, 500 MB KR Trademark 5 million data Accuracy Depends on similarity definition definition Similar character, similarly pronounced words algorithm Independent of languages Easy of Use Delay of new trademark registration Registration of new trademark trademark within a few seconds Expandability Applying a new pattern requires requires overall indexing No overall indexing Independent of languages
  • 36. Big data in Memory solution, AiM 1. Basic Fn(Ansi Query) 2. User Defined Fn 3. Statistics Fn 4. Other data analysis Fn User Defined Basic Fn + = - > < Data analysis tool Add New pattern High level memory utilization Efficient memory use Super light Search Engine Tech Massive data analysis partSearch engine part AiM Existing Search Engine TechGeneral Solution Meet user requirements such as data analysis speed, analysis tool and statistical methods Big Data Structure and Algorithm Fast and convenient

Editor's Notes

  1. 메모리 활용에 예 오픈소스로 가능? = 일반화에 위배 설혹 가능하더라도? 문제???
  2. 알고리즘 측면 현재 시스템에서 불가능한 것으로 판명 특허청 유사 상표 조회 알고리즘의 중요성(차별화된 구조 기반) 오픈소스로 과연 이런 작업이 가능할 것인가?
  3. 실제 사례 Core 기술 이용의 당위성 오픈소스에 대한 끝없는 질문?
  4. 실제 저희 엔진에 사용되고 있는 예 메모리 축약의 중요성 메모리 기반 Big Data 처리 여건 형성
  5. 개발할 상세 요소 기술 구성도 메모리 강조 : big Data in memory 필요 하지만 사라진다 단위서버당 메모리 활용 처리 용량 = 서버당 처리 용량 수준이상 이라면? 대상 자료
  6. 실제 사례 Core 기술 이용의 당위성 오픈소스에 대한 끝없는 질문?
  7. 어떤점이 다른가? 만들어진 엔진을 사용하지 않는다. 설계도면과 내부 구조를 들여다 보는 것과 같은 infra를 가지고 본 과제를 제안  차별성 할일은 ? 누구나 같다, 정/비정형화
  8. 개인화 맞춤 검색과 일반 검색 비교 일반검색  필요에 따라 batch ordering 된 색인 DB 운영 개인화 맞춤 검색 실시간 ranking, feed back, 검색 이력 관리 1~4 의 실용화 요소 기술이 요구 여러가지 개인화 검색의 접근 방법 중 본 과제의 접근 방법 중심  경량화 검색엔진 기술 활용
  9. 어떤점이 다른가? 만들어진 엔진을 사용하지 않는다. 설계도면과 내부 구조를 들여다 보는 것과 같은 infra를 가지고 본 과제를 제안  차별성 할일은 ? 누구나 같다, 정/비정형화