In this talk, we will provide an overview of Elasticsearch for Apache Hadoop (ES-Hadoop), which includes integrations between the various Hadoop libraries, whether batch (Map/Reduce, Pig, Hive) or stream oriented (such as Apache Spark). We will also cover the YARN support and the HDFS snapshot/restore plugin available as part of ES-Hadoop. We will talk about the upcoming ES-Hadoop 2.1 GA release and near-term roadmap.
10. www.elastic.co
Native integration - Cascading
Tap
in
=
new
EsTap("radio/artists","?q=me*");
Tap
out
=
new
StdOut(new
TextLine());
new
LocalFlowConnector().
connect(in,
out,
new
Pipe(“pipe")).complete();
JobClient.runJob(conf);
Tap
in
=
Lfs(new
TextDelimited(
new
Fields("id",
"name",
"url",
"picture")),
"artists.dat");
Tap
out
=
new
EsTap("radio/artists",
new
Fields("name",
"url",
"picture"));
new
HadoopFlowConnector().
connect(in,
out,
new
Pipe(“pipe")).complete();
10
11. www.elastic.co
Native integration - Apache Pig
A
=
LOAD
'radio/artists'
USING
org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*');
DUMP
A;
A
=
LOAD
'src/artists.dat'
USING
PigStorage()
AS
(id:long,
name,
url:chararray,
picture:
chararray);
B
=
FOREACH
A
GENERATE
name,
TOTUPLE(url,
picture)
AS
links;
STORE
B
INTO
'radio/artists'
USING
org.elasticsearch.hadoop.pig.EsStorage();
11
12. www.elastic.co
Native integration - Apache Hive
CREATE
EXTERNAL
TABLE
artists
(
id
BIGINT,name
STRING,
links
STRUCT<url:STRING,
picture:STRING>)
STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='radio/artists','es.query'='?q=me*');
SELECT
FROM
artists;
CREATE
EXTERNAL
TABLE
artists
(
id
BIGINT,name
STRING,
links
STRUCT<url:STRING,
picture:STRING>)
STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='radio/artists');
INSERT
OVERWRITE
TABLE
artists
SELECT
s.name,
named_struct('url',
s.url,
'picture',
s.pic)
FROM
source
s;
12
13. www.elastic.co
Native integration - Apache Spark
import
org.elasticsearch.spark._
val
sc
=
new
SparkContext(new
SparkConf())
val
rdd
=
sc.esRDD("radio/artists",
"?me*")
import
org.elasticsearch.spark._
case
class
Artist(name:
String,
albums:
Int)
val
u2
=
Artist("U2",
12)
val
bh
=
Map("name"-‐>"Buckethead","albums"
-‐>
95,
"age"
-‐>
45)
sc.makeRDD(Seq(u2,
h2)).saveToEs("radio/artists")
13
14. www.elastic.co
Native integration - Spark SQL
val
sql
=
new
SQLContext...
val
df
=
sql.load("radio/artists",
"org.elasticsearch.spark.sql")
df.filter(df("age")
>
40)
val
sql
=
new
SQLContext...
val
table
=
sql.sql("CREATE
TEMPORARY
TABLE
artists
"
+
"USING
org.elasticsearch.spark.sql
"
+
"OPTIONS(resource=`radio/artists`)
")
val
names
=
sql.sql("SELECT
name
FROM
artists")
14
15. www.elastic.co
Native integration - Apache Storm
TopologyBuilder
builder
=
new
TopologyBuilder();
builder.setBolt("esBolt",
new
EsBolt("twitter/tweets"));
TopologyBuilder
builder
=
new
TopologyBuilder();
builder.setSpout("esSpout",new
EsSpout("twitter/tweets","?q=nfl*",
5);
Builder.setBolt("bolt“,
new
PrinterBolt()).shuffleGrouping("esSpout");
15
17. www.elastic.co
YARN support – In Beta
• Run Elasticsearch on YARN
• But YARN doesn’t support long-lived services (yet):
• No provisioning
• No ip/network guarantees
• Data/node affinity
• Next YARN releases plan to address this
• Tracking projects like Apache Slider
17
19. www.elastic.co
HDFS integration
• Snapshot/Restore
• Use HDFS as a shared storage
• Backup and recover data
• Works great with snapshot immutable data
• HDFS as a File-System – not recommended / tread carefully
• Incomplete FS semantics (last-delete-on-close, fsync)
• NFSv3 (metadata issues)
• See Elasticsearch issue #9072
19
20. www.elastic.co
20
• Support for Spark, Spark SQL, Storm
• Includes support for Spark (core and SQL) 1.2, 1.3 and 1.4
• Support for all Spark SQL filters and relationship traits
• Certification with Hadoop distributions
• Currently certified with CDH5.x, HDP2.x, MapR 4.x and Databricks Spark
• Security enhancements
• Basic HTTP authentication allowing Hadoop jobs running against a restricted Elasticsearch cluster to identify themselves
accordingly
• SSL/TLS support for cryptographic connections between Elasticsearch and Hadoop cluster, enabling data-sensitive
environments to transparently encrypt the data at transport level and thus prevent snooping and preserve data
confidentiality.
• Support for Shield-enabled Elasticsearch clusters
• Several enhancements and performance improvements, including
• Client node routing
• Return raw JSON and metadata while reading documents from ES
• Inclusion / Exclusion of fields to be written to ES
What’s New in ES-Hadoop 2.1
21. www.elastic.co
• Support for ES aggregations
• Marvel integration
• Integration with Machine Learning libraries e.g Mllib
• Others? (Suggestions)
Roadmap
21