2017 02-07 - elastic & spark. building a search geo locator

Roma – 7 Febbraio 2017
presenta Alberto Paro, Seacom
Elastic & Spark.
Building A Search Geo Locator

Alberto Paro
 Laureato in Ingegneria Informatica (POLIMI)
 Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech
review
 Lavoro principalmente in Scala e su tecnologie BD
(Akka, Spray.io, Playframework, Apache Spark) e NoSQL
(Accumulo, Cassandra, ElasticSearch e MongoDB)
 Evangelist linguaggio Scala e Scala.JS

Elasticseach 5.x - Cookbook
 Choose the best ElasticSearch cloud topology to deploy and power it up
with external plugins
 Develop tailored mapping to take full control of index steps
 Build complex queries through managing indices and documents
 Optimize search results through executing analytics aggregations
 Monitor the performance of the cluster and nodes
 Install Kibana to monitor cluster and extend Kibana for plugins.
 Integrate ElasticSearch in Java, Scala, Python and Big Data applications
Discount code for Ebook: ALPOEB50
Discount code for Print Book: ALPOPR15
Expiration Date: 21st Feb 2017

Obiettivi
 Architetture Big Data con ES
 Apache Spark
 GeoIngester
 Data Collection
 Ottimizzazione Indici
 Ingestion via Apache Spark
 Ricerca per un luogo
 Cenni di Big Data Tools

Hadoop / Spark
Input
Iter 1
HDFS
Iter 2
HDFS
HDFS
Read
HDFS
Read
HDFS
Write
HDFS
Write
Input
Iter 1 Iter 2
Hadoop MapReduce
Apache Spark
Evoluzione del modello Map Reduce

Apache Spark
 Scritto in Scala con API in Java, Python e R
 Evoluzione del modello Map/Reduce
 Potenti moduli a corredo:
 Spark SQL
 Spark Streaming
 MLLib (Machine Learning)
 GraphX (graph)

Geoname
GeoNames è un database geografico, scaricabile gratuitamente sotto
licenza creative commons.
Contiene circa 10 millioni di nomi geografici e consiste di circa 9
milioni di feature uniqche di cui 2.8 milioni di posti popolati e 5.5
millioni di nomi alternativi.
Può essere facilmente scaricato da
http://download.geonames.org/export/dump come file CSV.
Il codice è disponibile all’indirizzo:
https://github.com/aparo/elasticsearch-geonames-locator

Geoname - Struttura
No. Attribute name Explanation
1 geonameid Unique ID for this geoname
2 name The name of the geoname
3 asciiname ASCII representation of the name
4 alternatenames Other forms of this name. Generally in several languages
5 latitude Latitude in decimal degrees of the Geoname
6 longitude Longitude in decimal degrees of the Geoname
7 fclass Feature class see http://www.geonames.org/export/codes.html
8 fcode Feature code see http://www.geonames.org/export/codes.html
9 country ISO-3166 2-letter country code
10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code
11 admin1 Fipscode (subject to change to iso code
12 admin2 Code for the second administrative division, a county in the US
13 admin3 Code for third level administrative division
14 admin4 Code for fourth level administrative division
14 population The Population of Geoname
14 elevation The elevation in meters of Geoname
14 gtopo30 Digital elevation model
14 timezone The timezone of Geoname
14 moddate The date of last change of this Geoname

Ottimizzazione indici – 1/2
Necessario per:
 Rimuove campi non richiesti.
 Gestire campi Geo Point.
 Ottimizzare i campi stringa (text, keyword)
 Numeri shard corretto (11M records => 2 shards)
Vantaggi => performances/spazio/CPU

Ottimizzazione indici – 2/2
{
"mappings": {
"geoname": {
"properties": {
"admin1": {
"type": "keyword",
"ignore_above": 256
},
…
"alternatenames": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
…
…
"location": {
"type": "geo_point"
},
…
"longitude": {
"type": "float"
},
"moddate": {
"type": "date"
},

Ingestion via Spark – GeonameIngester – 1/7
Il nostro ingester eseguirà i seguenti steps:
 Inizializzazione Job Spark
 Parse del CSV
 Definizione della struttura di indicizzazione
 Popolamento delle classi
 Scrittura dati in Elasticsearch
 Esecuzione del Job Spark

Inizializzazione di un Job Spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.elasticsearch.spark.rdd.EsSpark
import scala.util.Try
object GeonameIngester {
def main(args: Array[String]) {
val sparkSession = SparkSession.builder
.master("local")
.appName("GeonameIngester")
.getOrCreate()

Parse del CSV
val geonameSchema = StructType(Array(
StructField("geonameid", IntegerType, false),
StructField("name", StringType, false),
StructField("asciiname", StringType, true),
StructField("alternatenames", StringType, true),
StructField("latitude", FloatType, true), ….
val GEONAME_PATH = "downloads/allCountries.txt"
val geonames = sparkSession.sqlContext.read
.option("header", false)
.option("quote", "")
.option("delimiter", "t").option("maxColumns", 22)
.schema(geonameSchema)
.csv(GEONAME_PATH)
.cache()

Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String],
latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String,
cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4:
Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate:
String)
implicit def emptyToOption(value: String): Option[String] = {
if (value == null) return None
val clean = value.trim
if (clean.isEmpty) { None } else { Some(clean)}
}

Popolazione delle nostre classi
val records = geonames.map {
row =>
val id = row.getInt(0)
val lat = row.getFloat(4)
val lon = row.getFloat(5)
Geoname(id, row.getString(1), row.getString(2),
Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil),
lat, lon, GeoPoint(lat, lon),
row.getString(6), row.getString(7), row.getString(8), row.getString(9),
row.getString(10), row.getString(11), row.getString(12), row.getString(13),
row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17),
row.getDate(18).toString
)
}

Scrittura in Elasticsearch
EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" ->
"geonameid"))
Esecuzione di uno Spark Job
spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-
assembly-1.0.jar
(~20 minuti su singola macchina)

Ricerca di un luogo
curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{ "term": { "name": "moscow"}},
{ "term": { "alternatenames": "moscow"}},
{ "term": { "asciiname": "moscow" }}
],
"filter": [
{ "term": { "fclass": "P" }},
{ "range": { "population": {"gt": 0}}}
]
}
},
"sort": [ { "population": { "order": "desc"}}]
}'

NoSQL
Key-Value
 Redis
 Voldemort
 Dynomite
 Tokio*
BigTable Clones
 Accumulo
 Hbase
 Cassandra
Document
 CouchDB
 MongoDB
 ElasticSearch
GraphDB
 Neo4j
 OrientDB
 …Graph
Message Queue
 Kafka
 RabbitMQ
 ...MQ

Linguaggio – Scala vs Java
public class User {
private String firstName;
private String lastName;
private String email;
private Password password;
public User(String firstName, String lastName,
String email, Password password) {
this.firstName = firstName;
this.lastName = lastName;
this.email = email;
this.password = password;
}
public String getFirstName() {return firstName; }
public void setFirstName(String firstName) { this.firstName = firstName; }
public String getLastName() { return lastName; }
public void setLastName(String lastName) { this.lastName = lastName; }
public String getEmail() { return email; }
public void setEmail(String email) { this.email = email; }
public Password getPassword() { return password; }
public void setPassword(Password password) { this.password = password; }
@Override public String toString() {
return "User [email=" + email + ", firstName=" + firstName + ", lastName=" + lastName + "]"; }
@Override public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((email == null) ? 0 : email.hashCode());
result = prime * result + ((firstName == null) ? 0 : firstName.hashCode());
result = prime * result + ((lastName == null) ? 0 : firstName.hashCode());
result = prime * result + ((password == null) ? 0 : password.hashCode());
return result; }
@Override public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
User other = (User) obj;
if (email == null) {
if (other.email != null)
return false;
} else if (!email.equals(other.email))
return false;
if (password == null) {
if (other.password != null)
return false;
} else if (!password.equals(other.password))
return false;
if (firstName == null) {
if (other.firstName != null)
return false;
} else if (!firstName.equals(other.firstName))
return false;
if (lastName == null) {
case class User(
var firstName:String,
var lastName:String,
var email:String,
var password:Password)
JAVASCALA

Grazie per
l’attenzione
Alberto Paro

2017 02-07 - elastic & spark. building a search geo locator

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 2017 02-07 - elastic & spark. building a search geo locator

Similar to 2017 02-07 - elastic & spark. building a search geo locator (20)

More from Alberto Paro

More from Alberto Paro (9)

Recently uploaded

Recently uploaded (20)

2017 02-07 - elastic & spark. building a search geo locator

Editor's Notes