Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
• Shard allocation• Avoid rebalancing (Discovery Timeout)• Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins• LUCENE-2205 Rework of the TermInfosReader class to remove the Terms, TermInfos, and the index pointer long and create a more memory efﬁcient data structure.
3 AP server / MC c1.xlarge6 ES Master Nodes 6 Node Hadoop Clusterc1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
Cutting the cost• Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,...• no more time-based indizes• Dedicated Hardware• SSD Disks• Share Hardware for ES and Hadoop