The talk presents a new technique of realtime single entity information extraction and investigation. The technique eliminates regular refresh and persistence of data within the search engine (ETL), providing real-time access to source data and improving response times using in-memory data techniques. The solution presented is a concrete solution with live customers, based upon real business needs. I will explain the architectural overview, the technology stack used based on Apache Lucene library, the accomplished results and how to scale out the solution.
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - Codemotion Milan 2016
1. Search on the fly:
how to lighten your Big Data
Simona Russo
Auro Rolle
MILAN 25-26 NOVEMBER 2016
2. Who am I?
Head of R&D
FacilityLive
4years
Major Italian
Software houses
+20years
Software
products
(developed!)
4sw products
Finance
3years
Payroll
3years
Logistic
+10years
Search
4years
… so I’m
+20years old
Simona Russo
... some numbers
4. AGENDA
• Use cases
• Business needs
• Analysis
• Architectural overview
• Some components
• Ram indexes
• Cache management
• Push data from server to client
• Optimize http data serialization
• Scale out
• Stress test and performance test
• Future improvements
Search on the fly:
how to lighten your Big Data
5. Prerequisite: Lucene
Lucene is the de facto standard
for search library.
Initially developed by Doug Cutting in 1999 (also author of Hadoop in
2003), it joined Apache software foundation in 2001.
See https://en.wikipedia.org/wiki/Apache_Lucene
Apache Lucene is an open source full-featured search engine library
written entirely in JAVA:
• many powerful query types: phrase queries, wildcard queries, range queries
• fielded searching (i.e. title, author, contents)
• synonyms, stopwords options
• … and more. See http://lucene.apache.org/ for further information
6. Use cases: Business needs
“I want
constantly
updated data”
“I want excellent
search response time”
“I want to search all the available
data I have (even from different
data sources)”
“I don’t want to
duplicate my data”
8. Search Platform
Use cases: Conceptual high level layer overview
External Data
Services
DB
Files
CSV, XLS, …
External
Services
…
…
How to follow
business
needs
Presentation Layer: search results
Data layer
i.e.
Customers
View
Lucene Indexes
i.e.
Invoices
View
i.e.
Products
View
…
9. Use cases: Business needs
Data Sources
Analysis: how to integrate different data sources and search
on them?
Simple! We can create one or many Lucene Indexes integrating different data sources
(schema-less feature).
• Joining data sources
• Transforming data values
• Enriching , cleansing and indexing data values
Can we use data virtualization/federation middleware from others vendors?
• Yes, we can, but we have double data integrations:
1. from Data Source to Data Virtualization Middleware
2. from Data Virtualization to Lucene Indexes.
Data
Virtualization/Feder
ations middleware
Data Layer
Lucene Indexes
It may add latency/consistency/poor flexibility and revenues issues in a search driven
platform.
So we decided not to use it but instead to integrate data into Lucene Indexes directly.
“I want to search all the available
data I have (even from different
data sources)”
10. Use cases: Business needs
Analysis: How to avoid data sources duplications?
We have been indexing with Lucene only the metadata, the
data that the user wants to search.
All others data (i.e. pdf, html page, …) can reside on the
source system and can be accessed on demand.
So we significantly reduce data duplication
“I don’t want
duplicate my data”
11. Use cases: Business needs
Analysis: how to get constantly updated data?
Constantly updated data must constantly call external services
to retrieve updated data!
Issues:
• Overload external data services
• Worst search time (before every search call service, create
indexes and then search it!)
• Not feasible if I have to index ALL business data (Big Data) at
search time or very small time interval
12. Use cases: Business needs
Analysis: how do I get constantly
updated data?
We need to define the scope to
verify that the business needs are
satisfied
We need to define a path
13. Use cases: Constraints
Consider the example of a Call Center use case where the Operator has a
set of keywords to identify the caller (for example: phone number, fiscal
code, customer id, …) and needs to find all the related business
information.
The data is retrieved and indexed once the customer is identified and then
the operator refines the search to find specific customer information
related to the call.
If the operator does not interact with the system for a predefined time
interval (for example 1 minute) the indexed data can be discarded and the
next operator interaction will produce updated data.
14. Use cases: Constraints
The use-case is applicable to the single
entity nature of the data (for example
the customer of a call center) and the
fact that the data is found using known
specific key.
The technique is described as on-the-fly
because the data is retrieved and
indexed in realtime.
The application uses the data to enable
the user to find the specific information
required for the business transaction
without recontacting the source
systems.
“I want
constantly
updated data”
“I don’t want to
duplicate my data”
15. Use cases: Business needs
Analysis: How to get excellent response time?
With the previous use case contraints we are indexing only
the data correlated to the single entity searched:
• So we could create in memory Lucene Index (RAM)
because we have a smaller data set. RAM Indexes have
best performances than on disk indexes, but they must be
evicted after a prefixed time (CACHE with eviction policy).
How to prevent one external data service from delaying the
search response time?
• Push the data from single services to the Browser as soon
as they are available (SSE).
“I want excellent
search response time”
16. Architectural overview: Components
• Lucene RAM Indexes
• Improve response time
• Reduce data duplications
• Cache
• Improve response time of refined search
• Reduce data duplications (eviction policy)
• Push data when the data are available
• Improve total response time
17. Lucene RAM Index
A memory resident implementation of Lucene Index.
WARNING!
Lucene RAMDirectory implementation is not intended to work with huge indexes
Everything beyond several hundred megabytes will waste resources (GC cycles),
because it uses an internal buffer size of 1024 bytes, producing millions of
byte[1024] arrays. This class is optimized for small memory-resident indexes. It also
has bad concurrency on multithreaded environments.
See
https://lucene.apache.org/core/6_3_0/core/org/apache/lucene/store/RAMDirector
y.html
• Limited environment in terms of data (single entity) and concurrent users
18. Lucene RAM Indexes
How we use RAM Indexes?
RAM index are indentified by:
• UserId
• Name (Logical Name of the data set), it depends on the user customization i.e.
customer, product, invoices
• Context Id (alphanumeric value), a new context id is attributed to it when there
are empty results from the search.
So the first search has a new context id, the refined search has the same context
id.
The same search from different user leads to the creation of new indexes.
19. Lucene RAM Index
In a cluster environment the RAM Indexes related to a user search resides always
on a dedicated node (load balancer with stickyness by user).
If this node fails the, the refined search will redirect to another node, so the RAM
index will be created again.
User
B
Node 1 RAM
(A) (A) (A)
Node 2 RAM
(B) (B)
Load
Balancer
User
A
(A) (A)
(B)
(A)
20. Why Cache?
Because we use RAM Indexes and we want to access Lucene Indexes when
the user wants to refine the previous search.
Why not ConcurrentHashMap?
Don’t reinvent the wheel! Because if we use cache we don’t have to build
a lot of built-in functions among which eviction policies!
Eviction policies:
Every cache needs to remove values using different criteria: by time, size,
weight.
By Time (expiration), remove elements that have reached:
• idle time, span of time when no operation is performed with the
given key (get/put key);
• total lived time, maximum span of time an element can spend in the
cache.
21. Caches benchmark
JMH Microbenchmark: put/get (4 threads + 4 threads)
1000 alphanumeric key per iteration pre-inserted (plus 1000 with put operation)
-i 10 -wi 2 -r 2s –f 5
0
10
20
30
40
50
60
Cache2k Guava EhCache Infinispan
Millions
Put (ops/sec)
Get (ops/sec)
MacBookPro Intel Core I7 CPU 2.5 GHz 4 core (2 threads per core) 16GB RAM
• Cache2K 0.28-BETA
• Guava version 20.0 (conc. 100)
• Infinispan 8.2.4.Final (conc. 100)
• EhCache 3.1.3
We tested and
selected the libs in
2015 (jdk 1.7)
This benchmarks
are updated with
the last libs
versions (jdk 1.8)
Cache2K Status
“ … We use every
cache2k release
within production
environments.
However, some of
the basic features
are still evolving and
there may be API
breaking changes
until we reach
version 1.0.”
https://cache2k.org/
22. Cache: Guava
We selected GUAVA for overall best performance results (after cache2k) and
because we need a simple local cache (not distributed) compatible with jdk 1.7.
The following is an example how to create a cache with time idle expiration
policy:
We use the time idle expiration policy to remove expired Indexes from the cache.
23. Cache: Guava
Expired elements
• Logical removed
When an element expires, it is not automatically removed from the cache
Expired entries will never be visible to read or write operations.
Cache.size() counts also expired entries.
• Physical removed
• Trying to write/access to expired entry
• Calling Cache.cleanUp() (force eviction)
CleanUp is an expensive operation, so it is not automatically performed by
GUAVA but the developer has to do it.
24. HOW to push data to browser?
Push data from server to browser
25. Polling
Client repeatedly sends new requests to a server. If the server has the response data
it sends the response, otherwise it sends an empty response.
Push data from server to client
Web
Browser
ServerREQUEST
RESPONSE (with data or empty)
REQUEST
RESPONSE (with data or empty)
26. Long Polling
Client sends request to a server: if the server has no data it holds the connection and
waits until data is available and then sends the data back to the client.
Push data from server to client
Web
Browser
ServerREQUEST
RESPONSE (with data)
REQUEST
RESPONSE (with data)
27. Server Sent Events
Similar to long-polling, when the client sends a request to a server that waits until
data is available and then sends the data back to the client as one or more events.
The client processes the data without closing the connection until the server sends
the last event.
Push data from server to client
Web
Browser
ServerREQUEST
EVENT (with data)
EVENT (with data)
EVENT (close connection)
28. Server Sent Events (SSE)
• The server transmits data to browser as a continuos stream with event-
stream content-type, over a connection which is left open
• Transported over HTTP it can be poly-filled with javascript to backport SSE
to browsers that do not support it yet.
• JAVA EE8 JSR370 (JAX RS 2.1) Proposed specification
“SSE is a new technology defined as part of the HTML5 set of recommendations for a
client (e.g., a browser) to automatically get updates from a server via HTTP. It is
commonly employed for one-way streaming data transmissions in which a server updates
a client periodically or every time an event takes place”
• Java Library to manage client/server SSE connection:
• https://jersey.java.net/documentation/latest/sse.html
• WebSocket?
Push data from server to client
29. We use KRYO, a Java framework for fast and efficient object serialization, to reduce
the size of object for the data being serialized in a http connection.
We reduce http service response time up to 50%!
Optimize http data serialization
https://github.com/EsotericSoftware/kryo
Benchmark:
https://github.com/eishay/jvm-serializers/wiki
30. Search
Manager
Architectural high level overview
External
data
Services
…
Presentation Layer: search results
v1 v2 v3 v4 v5
4.HTTP Rest
Access by
Keyword
(1 per service)
2. Search result
“Context” Indexes
searchSearch
1. Search by
“keyword”
3. No Results!
Access by
keyword
ASYNC
SSE
GUAVA Cache
RAM
Indexes
RAM
Indexes
RAM
Indexes
Service
Mapper
Service
Mapper
Service
Mapper
Service
Mapper
6. Send Response
(1 per service)
5. Create RAM
Indexes
Send event resp v1 Send event resp v2 Send event resp vx1. Refine
Search
Kryo serializer
31. Scale out
Presentation Layer: search results
Search
Manager
GUAVA Cache
RAM
Indexes
RAM
Indexes
RAM
Indexes
Service
Mapper
Service
Mapper
Service
Mapper
Service
Mapper
Cluster
Load Balancer
(stickiness by user)
Presentation Layer: search resultsPresentation Layer: search resultsLoad
Balancer
(stickiness
by user)
Search Layer
Cluster
34. 32 GB RAM – 8 Core – Heap 20 GB
Phase 1
Incremental tests to find the max
transaction frequency
• 400 Vuser
• 800 Vuser
Phase 2
Environment minimization and direct load
on the search engine
37. Phase 2
Direct load injection on a single BE via API
Search and indexing transactions with the
same document weight used in phase 1:
19 binary Add document on 19 Lucene
indexes. Average size 20k per Add
1 Search with 24 results
39. Phase 2
Test 200 Vuser
43 [Tr/s]
Clean UP 15s
Samples Average response time [ms] Errors % Transactions per second
TOTAL 1107087 66 0.00% 860.0
40. Future improvements
• We are migrating to a microservices architecture:
• For example: Service Mapper and Searcher Manager are
easily converting to Microservices and scale out
independently
• We are testing the benefits/drawback to migrate to HTTP2
• Header compression
• Server push
• Binary protocols
41. Search on the fly:
how to lighten your Big Data
MILAN 25-26 NOVEMBER 2016