Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real time analytics using 
Hadoop 
and 
Elasticsearch 
by 
ABHISHEK ANDHAVARAPU
Thank you Sponsors!
About Me 
• Currently working as Software 
Engineer (Data Platform) at 
Allegiance Software Inc. 
• Passion for Distribute...
Agenda 
Use Case. 
Architecture. 
Elasticsearch 101. 
Demo. 
Lessons learnt.
Legacy Architecture 
5
Current Architecture
Why Hadoop ?
Elasticsearch 101 
• Document oriented search engine Json based, apache 
lucene under covers. 
• Schema free. 
• Its distr...
Elasticsearch CRUD 
Index a person: 
curl -XPUT ‘localhost:9200/person/1’ -d '{ 
"first_name" : "Abhishek", 
"last_name" :...
Elasticsearch data 
Node1 Node2 
S0 S1 
Shard
Replicas 
Node1 Node2 
S0 S0 
S1 S1 
Blue - Replica 
Red - Primary 
Shard
More nodes.. 
Node1 Node2 
S0 S1 
Node3 Node4 
S1 S0 
Blue - Replica 
Red - Primary
Node down 
Node1 Node2 
S0 S1 
Node3 Node4 
S1 S0 
Blue - Replica 
Red - Primary
Node1 
S0 
Node down 
Node3 Node4 
A1 S1 
S0 
Blue - Replica 
Red - Primary 
S1 
Re-replicated 
Promoted to Primary
Elasticsearch 101 
• Lucene is under covers. 
• Each index (like a database) is made up of multiple 
shards(lucene instanc...
How is it Fast ? 
Distributed execution 
Client 
Node 2 
Node 1 
S0 S1 S0 S1 
Query 
Red - Primary 
Blue - Replica
DEMO 
• Import data from SQL database 
in to Hive. (Extract) 
• Run the necessary 
computations using 
Hadoop/Hive. (Trans...
Current Elasticsearch Cluster 
• 9 bare metal boxes 
• 128 GB RAM 
• 2X SSD 
• 10 GB Ethernet 
• 2X 10 core Xeon Processor...
Zabbix 
What’s slow ? 
Any request that takes more than 300ms is slow
Lessons Learnt
Concurrency 
• More replication for more currency. Updates are costly. 
• More shards much faster. 
• SQL 3 to 5k per minu...
Filter Cache 
• All the filters have a cache flag that controls if they 
are cached or not. 
• Once the filter cache is wa...
Field Data 
• For sorting, aggegration etc.. all the field values are 
loaded in to memory called field data. 
• By defaul...
JVM memory - Friend or Foe ? 
to replicate which are still serving requests causing additional heap
Getting Bad 
Solution ? 
More memory. 
Not necessarily more boxes.
Elasticsearch Cons 
• Not commodity hardware 6K (Hadoop) vs 10K (SSD) 
• GC issues. 
• Circuit breakers doesn’t protect yo...
Thank you 
• abhishek376.wordpress.com 
• abhishek376@gmail.com 
• Twitter : abhishek376 
We are Hiring !!
Upcoming SlideShare
Loading in …5
×

Real time analytics using Hadoop and Elasticsearch

4,213 views

Published on

Real time analytics using Hadoop and Elasticsearch

Published in: Technology
  • Be the first to comment

Real time analytics using Hadoop and Elasticsearch

  1. 1. Real time analytics using Hadoop and Elasticsearch by ABHISHEK ANDHAVARAPU
  2. 2. Thank you Sponsors!
  3. 3. About Me • Currently working as Software Engineer (Data Platform) at Allegiance Software Inc. • Passion for Distributed System, Data visualizations. • Masters in Distributed Systems. • abhishek376.wordpress.com
  4. 4. Agenda Use Case. Architecture. Elasticsearch 101. Demo. Lessons learnt.
  5. 5. Legacy Architecture 5
  6. 6. Current Architecture
  7. 7. Why Hadoop ?
  8. 8. Elasticsearch 101 • Document oriented search engine Json based, apache lucene under covers. • Schema free. • Its distributed, supports aggregations similar to group by . • Uses bit sets to efficiently cache. • It’s fast. Super fast. • Its has REST and Java based API’s
  9. 9. Elasticsearch CRUD Index a person: curl -XPUT ‘localhost:9200/person/1’ -d '{ "first_name" : "Abhishek", "last_name" : "Andhavarapu" }’ Get a person: curl -XGET 'localhost:9200/person/1' Delete a person: curl -XDELETE ‘localhost:9200/person/1’ Update a person: curl -XPOST 'localhost:9200/person/1/_update' -d '{ "doc" : { "first_name" : "Abhi" } }'
  10. 10. Elasticsearch data Node1 Node2 S0 S1 Shard
  11. 11. Replicas Node1 Node2 S0 S0 S1 S1 Blue - Replica Red - Primary Shard
  12. 12. More nodes.. Node1 Node2 S0 S1 Node3 Node4 S1 S0 Blue - Replica Red - Primary
  13. 13. Node down Node1 Node2 S0 S1 Node3 Node4 S1 S0 Blue - Replica Red - Primary
  14. 14. Node1 S0 Node down Node3 Node4 A1 S1 S0 Blue - Replica Red - Primary S1 Re-replicated Promoted to Primary
  15. 15. Elasticsearch 101 • Lucene is under covers. • Each index (like a database) is made up of multiple shards(lucene instance). • Shards are distributed amongst all nodes in the cluster. • In case of failure or the addition of new nodes shards are automatically moved from one to another.
  16. 16. How is it Fast ? Distributed execution Client Node 2 Node 1 S0 S1 S0 S1 Query Red - Primary Blue - Replica
  17. 17. DEMO • Import data from SQL database in to Hive. (Extract) • Run the necessary computations using Hadoop/Hive. (Transform) • Push the data in to Elasticsearch. (Load) • Run queries against Elasticsearch.
  18. 18. Current Elasticsearch Cluster • 9 bare metal boxes • 128 GB RAM • 2X SSD • 10 GB Ethernet • 2X 10 core Xeon Processors • 2X 30GB Elasticsearch instances per box • 1 Elasticsearch load balancing instance to handle index requests
  19. 19. Zabbix What’s slow ? Any request that takes more than 300ms is slow
  20. 20. Lessons Learnt
  21. 21. Concurrency • More replication for more currency. Updates are costly. • More shards much faster. • SQL 3 to 5k per minute
  22. 22. Filter Cache • All the filters have a cache flag that controls if they are cached or not. • Once the filter cache is warmed, all the requests are served from the memory. • Defaults - 10% for the filter cache. • LRU. • Bit Sets.
  23. 23. Field Data • For sorting, aggegration etc.. all the field values are loaded in to memory called field data. • By default its unbounded. • Expensive to build, its recommended to hold this in memory. • They are circuit breakers to protect against this. • If the query is gonna use more than 60% of the JVM heap it will kill the query.
  24. 24. JVM memory - Friend or Foe ? to replicate which are still serving requests causing additional heap
  25. 25. Getting Bad Solution ? More memory. Not necessarily more boxes.
  26. 26. Elasticsearch Cons • Not commodity hardware 6K (Hadoop) vs 10K (SSD) • GC issues. • Circuit breakers doesn’t protect you against everything. • No built in security. Use ngnix proxy with authentication. • Learning curve. • Lot of updates hurt. Filter cache should be rebuilt, merges etc..
  27. 27. Thank you • abhishek376.wordpress.com • abhishek376@gmail.com • Twitter : abhishek376 We are Hiring !!

×