ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch

ES-Hadoop: Bridging the world of Hadoop and
Elasticsearch
Bala Venkatrao (bala@elastic.co)
June 2015

h"ps://github.com/elas3c/elas3csearch-‐hadoop

www.elastic.co
Elasticsearch for Apache Hadoop™
3

www.elastic.co
Certified to work
4

Partition-to-Partition architecture
Node1
2P
1R
Node2
1P
3R
Node3
2R
3P

Dynamic runtime matching
Node1

2P

1R

Node2

1P

3R

Node3

2R

3P

Failure handling
Node1

2P

1R

Node2

1P

3R

Node3

2R

3P

Co-location
Node1

2P

1R

Node2

1P

3R

Node3

2R

3P

www.elastic.co
Native integration - Map / Reduce
JobConf
conf
=
new
JobConf();

conf.setInputFormat(EsInputFormat.class);

conf.set("es.resource",
"radio/artists");

conf.set("es.query",
"?q=me*");

JobClient.runJob(conf);

JobConf
conf
=
new
JobConf();

conf.setOutputFormat(EsOutputFormat.class);

conf.set("es.resource",
"radio/artists");


9

www.elastic.co
Native integration - Cascading
Tap
in
=
new
EsTap("radio/artists","?q=me*");

Tap
out
=
new
StdOut(new
TextLine());

new
LocalFlowConnector().

connect(in,
out,
new
Pipe(“pipe")).complete();


Tap
in
=
Lfs(new
TextDelimited(

new
Fields("id",
"name",
"url",
"picture")),
"artists.dat");

Tap
out
=
new
EsTap("radio/artists",

new
Fields("name",
"url",
"picture"));

new
HadoopFlowConnector().

connect(in,
out,
new
Pipe(“pipe")).complete();

10

www.elastic.co
Native integration - Apache Pig
A
=
LOAD
'radio/artists'
USING

org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*');

DUMP
A;

A
=
LOAD
'src/artists.dat'
USING
PigStorage()
AS

(id:long,
name,
url:chararray,
picture:
chararray);

B
=
FOREACH
A
GENERATE
name,
TOTUPLE(url,
picture)
AS
links;

STORE
B
INTO
'radio/artists'
USING

org.elasticsearch.hadoop.pig.EsStorage();

11

www.elastic.co
Native integration - Apache Hive
CREATE
EXTERNAL
TABLE
artists
(

id
BIGINT,name
STRING,
links
STRUCT<url:STRING,
picture:STRING>)

STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource'='radio/artists','es.query'='?q=me*');

SELECT
FROM
artists;

CREATE
EXTERNAL
TABLE
artists
(

id
BIGINT,name
STRING,
links
STRUCT<url:STRING,
picture:STRING>)

STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource'='radio/artists');

INSERT
OVERWRITE
TABLE
artists
SELECT

s.name,
named_struct('url',
s.url,
'picture',
s.pic)
FROM
source

s;

12

www.elastic.co
Native integration - Apache Spark
import
org.elasticsearch.spark._

val
sc
=
new
SparkContext(new
SparkConf())

val
rdd
=
sc.esRDD("radio/artists",
"?me*")

import
org.elasticsearch.spark._

case
class
Artist(name:
String,
albums:
Int)

val
u2
=
Artist("U2",
12)

val
bh
=
Map("name"-‐>"Buckethead","albums"
-‐>
95,
"age"
-‐>
45)

sc.makeRDD(Seq(u2,
h2)).saveToEs("radio/artists")

13

www.elastic.co
Native integration - Spark SQL
val
sql
=
new
SQLContext...

val
df
=
sql.load("radio/artists",
"org.elasticsearch.spark.sql")

df.filter(df("age")
>
40)

val
sql
=
new
SQLContext...

val
table
=
sql.sql("CREATE
TEMPORARY
TABLE
artists
"
+

"USING
org.elasticsearch.spark.sql
"
+

"OPTIONS(resource=`radio/artists`)
")

val
names
=
sql.sql("SELECT
name
FROM
artists")

14

www.elastic.co
Native integration - Apache Storm
TopologyBuilder
builder
=
new
TopologyBuilder();

builder.setBolt("esBolt",
new
EsBolt("twitter/tweets"));

TopologyBuilder
builder
=
new
TopologyBuilder();

builder.setSpout("esSpout",new
EsSpout("twitter/tweets","?q=nfl*",
5);

Builder.setBolt("bolt“,
new

PrinterBolt()).shuffleGrouping("esSpout");

15

www.elastic.co
Resource Management

www.elastic.co
YARN support – In Beta
•  Run Elasticsearch on YARN
•  But YARN doesn’t support long-lived services (yet):
•  No provisioning
•  No ip/network guarantees
•  Data/node affinity
•  Next YARN releases plan to address this
•  Tracking projects like Apache Slider
17

www.elastic.co
HDFS integration
•  Snapshot/Restore
•  Use HDFS as a shared storage
•  Backup and recover data
•  Works great with snapshot immutable data
•  HDFS as a File-System – not recommended / tread carefully
•  Incomplete FS semantics (last-delete-on-close, fsync)
•  NFSv3 (metadata issues)
•  See Elasticsearch issue #9072

19

www.elastic.co
20

•  Support for Spark, Spark SQL, Storm
•  Includes support for Spark (core and SQL) 1.2, 1.3 and 1.4
•  Support for all Spark SQL filters and relationship traits
•  Certification with Hadoop distributions
•  Currently certified with CDH5.x, HDP2.x, MapR 4.x and Databricks Spark
•  Security enhancements
•  Basic HTTP authentication allowing Hadoop jobs running against a restricted Elasticsearch cluster to identify themselves
accordingly
•  SSL/TLS support for cryptographic connections between Elasticsearch and Hadoop cluster, enabling data-sensitive
environments to transparently encrypt the data at transport level and thus prevent snooping and preserve data
confidentiality.
•  Support for Shield-enabled Elasticsearch clusters
•  Several enhancements and performance improvements, including
•  Client node routing
•  Return raw JSON and metadata while reading documents from ES
•  Inclusion / Exclusion of fields to be written to ES
What’s New in ES-Hadoop 2.1

www.elastic.co
•  Support for ES aggregations
•  Marvel integration
•  Integration with Machine Learning libraries e.g Mllib
•  Others? (Suggestions)
Roadmap
21

www.elastic.co
22

Documentation –
https://www.elastic.co/guide/en/elasticsearch/hadoop/index.html
Project home page/ Source repository -
https://github.com/elastic/elasticsearch-hadoop
Issue tracker - https://github.com/elastic/elasticsearch-hadoop/issues
Mailing list / forum - https://discuss.elastic.co/c/elasticsearch-and-hadoop
More Questions?

ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch

Similar to ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch (20)

More from MapR Technologies

More from MapR Technologies (20)

ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch