SlideShare a Scribd company logo
1 of 15
Elasticsearch in Architetture BigData
Seacom – Alberto Paro
#EsInADay17
Alberto Paro
 Laureato in Ingegneria Informatica (POLIMI)
 Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech
review
 Lavoro principalmente in Scala e su tecnologie BD
(Akka, Spray.io, Playframework, Apache Spark) e NoSQL
(Accumulo, Cassandra, ElasticSearch e MongoDB)
 Evangelist linguaggio Scala e Scala.JS
ElasticSearch 5.x - Cookbook
 Scegliere la migliore topologia cloud per il deploy e potenziarlo con
plugin esterni
 Sviluppare mapping ottimizzati per avere un controllo completo sugli
step di indicizzazione
 Costruire query complesse attraverso la corretta di gestione dei
mapping e dei documenti
 Ottimizzazione dei risultati tramite le aggregazioni
 Monitorare le perfomance dei nodi e dei cluster
 Installare Kibana per monitor del cluster e estenderla con plugins.
 Integrazione di ElasticSearch in Java, Scala, Python e in applicazione Big
Data
Codice Sconto Ebook: ALPOEB50
Codice Sconto Cartaceo: ALPOPR15
Valido fino al: 21 Giugno 2017
Trasformare BigData in Valore
La ‘Datafication’
 Actività
 Conversazioni
 Testo
 Voce
 Social Media
 Browser log
 Foto
 Video
 Sensori
 Etc.
Volume
Veracity
Variety
Velocity
Big Data Analysing:
 Text analytics
 Sentiment analysis
 Face recognition
 Voice analytics
 Movement analytics
 Etc.
Valore
BigData Story Map
NoSQL Data Stores
Key-Value
 Redis
 Voldemort
 Dynomite
 Tokio*
BigTable Clones
 Accumulo
 Hbase
 Cassandra
Document
 CouchDB
 MongoDB
 ElasticSearch
GraphDB
 Neo4j
 OrientDB
 …Graph
Message Queue
 Kafka
 RabbitMQ
 ...MQ
NoSQL Evolution
Hadoop MR vs Apache Spark
Input
Iter 1
HDFS
Iter 2
HDFS
HDFS
Read
HDFS
Read
HDFS
Write
HDFS
Write
Input
Iter 1 Iter 2
Evoluzione del modello Map Reduce
Scrittura su disco / Blocchi da 128Mb
In RAM, blocchi da 2MB
100 volte
più veloce
Apache Spark
 Scritto in Scala con API in Java, Python e R
 Evoluzione del modello Map/Reduce
 Approccio funzionale alla gestione dei dati
 Facile integrazione con tutti i datastore
 Potenti moduli a corredo:
 Spark SQL
 Spark Streaming
 MLLib (Machine Learning)
 GraphX (graph)
BigData Macro Architecture
Big Data – Standard Architecture
Scrittura in Elasticsearch via Spark
Inizializzazione di Spark
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
Scrittura in Elasticsearch
Ogni Spark RDD e DataStream può essere scritto in Elasticsearch tramite un
semplice comando:
EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" ->
"geonameid"))
Lettura di ES via Spark
Lettura via Query
...
import org.elasticsearch.spark._
...
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")
SparkSQL Query
...
val df =
spark.read.format(“org.elasticsearch.spark.sql”).load(“spark/person_id”)
...
df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- username: string (nullable = true)
spark.sql(
“CREATE TEMPORARY VIEW persons USING org.elasticsearch.spark.sql
OPTIONS (resource ‘spark/person_id’, scroll_size ‘2000’)” )
val over20 = spark.sql(“SELECT * FROM persons WHERE age >= 20”)
Domande
Domande
e
…
Grazie per l’attenzione

More Related Content

Similar to Elasticsearch in architetture Big Data - EsInADay-2017

Similar to Elasticsearch in architetture Big Data - EsInADay-2017 (20)

Appunti di big data
Appunti di big dataAppunti di big data
Appunti di big data
 
Cassandra DB - Linux Day 2019 - Catania - Italy
Cassandra DB - Linux Day 2019 - Catania - ItalyCassandra DB - Linux Day 2019 - Catania - Italy
Cassandra DB - Linux Day 2019 - Catania - Italy
 
SQL Saturday 2019 - Event Processing with Spark
SQL Saturday 2019 - Event Processing with SparkSQL Saturday 2019 - Event Processing with Spark
SQL Saturday 2019 - Event Processing with Spark
 
Java al servizio della data science - Java developers' meeting
Java al servizio della data science - Java developers' meetingJava al servizio della data science - Java developers' meeting
Java al servizio della data science - Java developers' meeting
 
Elasticsearch a quick introduction
Elasticsearch a quick introductionElasticsearch a quick introduction
Elasticsearch a quick introduction
 
Utilizzo di tecnologie big data per addestramento di metamodelli matematici p...
Utilizzo di tecnologie big data per addestramento di metamodelli matematici p...Utilizzo di tecnologie big data per addestramento di metamodelli matematici p...
Utilizzo di tecnologie big data per addestramento di metamodelli matematici p...
 
Hug Italy- 30 Sept 2014, Milan
Hug Italy- 30 Sept 2014, MilanHug Italy- 30 Sept 2014, Milan
Hug Italy- 30 Sept 2014, Milan
 
Deploy MongoDB su Infrastruttura Amazon Web Services
Deploy MongoDB su Infrastruttura Amazon Web ServicesDeploy MongoDB su Infrastruttura Amazon Web Services
Deploy MongoDB su Infrastruttura Amazon Web Services
 
Workshop paas - ECDay 23 Maggio 2012
Workshop paas - ECDay 23 Maggio 2012Workshop paas - ECDay 23 Maggio 2012
Workshop paas - ECDay 23 Maggio 2012
 
Fabio Cecaro - WorkShop PaaS – Platform as a Services
Fabio Cecaro - WorkShop PaaS – Platform as a ServicesFabio Cecaro - WorkShop PaaS – Platform as a Services
Fabio Cecaro - WorkShop PaaS – Platform as a Services
 
Interfacce applicative al Sistema di Catalogazione del progetto MESSIAH
Interfacce applicative  al Sistema di Catalogazione del progetto MESSIAHInterfacce applicative  al Sistema di Catalogazione del progetto MESSIAH
Interfacce applicative al Sistema di Catalogazione del progetto MESSIAH
 
Un'Infrastruttura di Sviluppo Web Enterprise Distribuita Basata su Modelli Pa...
Un'Infrastruttura di Sviluppo Web Enterprise Distribuita Basata su Modelli Pa...Un'Infrastruttura di Sviluppo Web Enterprise Distribuita Basata su Modelli Pa...
Un'Infrastruttura di Sviluppo Web Enterprise Distribuita Basata su Modelli Pa...
 
Azure Synapse Analytics for your IoT Solutions
Azure Synapse Analytics for your IoT SolutionsAzure Synapse Analytics for your IoT Solutions
Azure Synapse Analytics for your IoT Solutions
 
Cloud, l’ecosistema platform
Cloud, l’ecosistema platformCloud, l’ecosistema platform
Cloud, l’ecosistema platform
 
Hadoop SAR
Hadoop SARHadoop SAR
Hadoop SAR
 
Hadoop [software architecture recovery]
Hadoop [software architecture recovery]Hadoop [software architecture recovery]
Hadoop [software architecture recovery]
 
Archeo foss 2012 slides 1
Archeo foss 2012 slides 1Archeo foss 2012 slides 1
Archeo foss 2012 slides 1
 
Predictive Maintenance per le aziende del nord-est con Azure e IoT
Predictive Maintenance per le aziende del nord-est con Azure e IoTPredictive Maintenance per le aziende del nord-est con Azure e IoT
Predictive Maintenance per le aziende del nord-est con Azure e IoT
 
Oracle Apex - Presentazione
Oracle Apex - PresentazioneOracle Apex - Presentazione
Oracle Apex - Presentazione
 
Big Data Analytics, Giovanni Delussu e Marco Enrico Piras
 Big Data Analytics, Giovanni Delussu e Marco Enrico Piras  Big Data Analytics, Giovanni Delussu e Marco Enrico Piras
Big Data Analytics, Giovanni Delussu e Marco Enrico Piras
 

More from Alberto Paro

More from Alberto Paro (9)

Data streaming
Data streamingData streaming
Data streaming
 
LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19
 
2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data
 
ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
 

Elasticsearch in architetture Big Data - EsInADay-2017

  • 1. Elasticsearch in Architetture BigData Seacom – Alberto Paro #EsInADay17
  • 2. Alberto Paro  Laureato in Ingegneria Informatica (POLIMI)  Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech review  Lavoro principalmente in Scala e su tecnologie BD (Akka, Spray.io, Playframework, Apache Spark) e NoSQL (Accumulo, Cassandra, ElasticSearch e MongoDB)  Evangelist linguaggio Scala e Scala.JS
  • 3. ElasticSearch 5.x - Cookbook  Scegliere la migliore topologia cloud per il deploy e potenziarlo con plugin esterni  Sviluppare mapping ottimizzati per avere un controllo completo sugli step di indicizzazione  Costruire query complesse attraverso la corretta di gestione dei mapping e dei documenti  Ottimizzazione dei risultati tramite le aggregazioni  Monitorare le perfomance dei nodi e dei cluster  Installare Kibana per monitor del cluster e estenderla con plugins.  Integrazione di ElasticSearch in Java, Scala, Python e in applicazione Big Data Codice Sconto Ebook: ALPOEB50 Codice Sconto Cartaceo: ALPOPR15 Valido fino al: 21 Giugno 2017
  • 4.
  • 5. Trasformare BigData in Valore La ‘Datafication’  Actività  Conversazioni  Testo  Voce  Social Media  Browser log  Foto  Video  Sensori  Etc. Volume Veracity Variety Velocity Big Data Analysing:  Text analytics  Sentiment analysis  Face recognition  Voice analytics  Movement analytics  Etc. Valore
  • 7. NoSQL Data Stores Key-Value  Redis  Voldemort  Dynomite  Tokio* BigTable Clones  Accumulo  Hbase  Cassandra Document  CouchDB  MongoDB  ElasticSearch GraphDB  Neo4j  OrientDB  …Graph Message Queue  Kafka  RabbitMQ  ...MQ
  • 9. Hadoop MR vs Apache Spark Input Iter 1 HDFS Iter 2 HDFS HDFS Read HDFS Read HDFS Write HDFS Write Input Iter 1 Iter 2 Evoluzione del modello Map Reduce Scrittura su disco / Blocchi da 128Mb In RAM, blocchi da 2MB 100 volte più veloce
  • 10. Apache Spark  Scritto in Scala con API in Java, Python e R  Evoluzione del modello Map/Reduce  Approccio funzionale alla gestione dei dati  Facile integrazione con tutti i datastore  Potenti moduli a corredo:  Spark SQL  Spark Streaming  MLLib (Machine Learning)  GraphX (graph)
  • 12. Big Data – Standard Architecture
  • 13. Scrittura in Elasticsearch via Spark Inizializzazione di Spark import org.apache.spark.SparkConf val conf = new SparkConf().setAppName(appName).setMaster(master) conf.set("es.index.auto.create", "true") Scrittura in Elasticsearch Ogni Spark RDD e DataStream può essere scritto in Elasticsearch tramite un semplice comando: EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" -> "geonameid"))
  • 14. Lettura di ES via Spark Lettura via Query ... import org.elasticsearch.spark._ ... val conf = ... val sc = new SparkContext(conf) sc.esRDD("radio/artists", "?q=me*") SparkSQL Query ... val df = spark.read.format(“org.elasticsearch.spark.sql”).load(“spark/person_id”) ... df.printSchema root |-- age: long (nullable = true) |-- name: string (nullable = true) |-- username: string (nullable = true) spark.sql( “CREATE TEMPORARY VIEW persons USING org.elasticsearch.spark.sql OPTIONS (resource ‘spark/person_id’, scroll_size ‘2000’)” ) val over20 = spark.sql(“SELECT * FROM persons WHERE age >= 20”)

Editor's Notes

  1. Landmark #1: Explosive Market Dynamics The purpose of Landmark #1 was to highlight the market challenges that were necessitating a different approach to integrating big data (data and analytics) into one’s business (we used cute landmarks instead of phases to keep in the spirit of the storymap). In the original blog, we discussed how organizations that don’t adapt to big data risk the following impacts to their business models: Profit and margin declines Market share losses Competitors innovating faster Missed business opportunities We also provided some examples of how organizations could exploit big data to power their businesses, including: Mine social and mobile data to uncover customers’ interests, passions, associations, and affiliations Exploit machine data for predictive maintenance and operational optimization Leverage behavioral insights to create a more compelling user experience Integrate new big data innovations to modernize data warehouse and business intelligence environments (real-time insights, predictive analytics) Become a data-driven culture Nurture and invest in data assets Cultivate analytic models and insights as intellectual property Assessment: A+. Yea, I think we got this one right. The business potential is too significant for organizations to ignore, and the Internet of Things (IoT) is only going to make data and analytics more indispensable to the future success of an organization. Also, if I were to redo the storymap, I’d definitely replace the river with a lake. For more business challenges and opportunities afforded by big data, check out these blogs: Driving Digital Business Transformation KPMG Survey: Firms Struggle with Big Data The Mid-market Big Data Call to Action Landmark #2: Business and IT Challenges The purpose of Landmark #2 was to highlight the significant challenges that organizations faced in trying to transform their business intelligence and data warehouse environments to take advantage of the business benefits offered by big data. The chart highlighted how traditional business intelligence and data warehouse environments are going to struggle to manage and analyze new data sources because of the following challenges: Rigid data warehouse architectures that impede exploiting immediate business opportunities Retrospective analysis that report what happened but doesn’t guide business decisions Social, mobile, or machine insights that are not available in an actionable manner Batch-oriented processes which delay access to the data for immediate analysis and action Brittle and labor intensive processes to add new data sources, reports, and analytics Environments that were performance and scalability challenged as data scales to petabytes Business analysis limited to aggregated and sampled data views Analytic environments unable to handle the tsunami of new, external unstructured data sources Assessment: C. I under-estimated the cultural challenges of moving from Business Intelligence / Data Warehouse to Data Science / Data Lake; the challenge to unlearn old approaches so that one can embrace new approaches. I also missed the growing important of the data lake as more than just a data repository; that the data lake would transform into the organization’s collaborative value creation platform that brings Business and IT stakeholders together to exploit the economic value of data and analytics. For more details on the challenges of transforming from a Business Intelligence to Data Science mentality, check out the below blogs: Dynamic Duo of Analytic Power: Business Intelligence Analyst PLUS Data Scientist New Roles in the Big Data World How I’ve Learned To Stop Worrying And Love The Data Lake Data Lake Plumbers: Operationalizing the Data Lake Landmark #3: Big Data Business Transformation The purpose of Landmark #3 was to provide a benchmark that helped organizations understand how effective they were in leveraging data and analytics to power their business models. The Big Data Business Model Maturity Index introduced 5 stages of measuring how effective organizations are at exploiting the business transformation potential of big data: Business Monitoring – deploys business intelligence to monitor on-going business performance Business Insights – leverages predictive analytics to uncover actionable insights buried in the detailed transactional data plus the growing wealth of internal and publicly available external data – at the level of the individual (think individual behavioral analysis) Business Optimization – embeds prescriptive analytics (think recommendations) into existing business processes to optimize select business operations Data Monetization – aggregates the insights gathered at the individual level to identify “white spaces” in unmet market and customer demand that can lead to new products, services, markets, channels, partners, audiences, etc. Business Metamorphosis – the cultural transformation to data and analytics as the center of the organization with incentives around the collection, transformation, and sharing of data and analytics including how employees are hired, paid, promoted, and managed. Assessment: A+. Nailed it! While the phase descriptions have evolved as we have learned more, this is probably my most important contribution to the world of Big Data – the “Big Data Business Model Maturity Index.” Not only does the maturity index help organizations understand where they are today with respect to leveraging the business model potential of big data, but it provides a guide to help them become more effective. Yeah, I finally got one right!! If you are interested in learning more about the “Big Data Business Model Maturity Index,” check out these blogs: Big Data Business Model Maturity Big Data Business Model Maturity Guide De-mystifying the Big Data Business Model Maturity Index Driving Digital Business Transformation Landmark #4: Big Data Journey The purpose of Landmark #4 was to define a process that drives alignment between IT and the Business to deliver actionable, business relevant outcomes. The steps in the process were: Identify the targeted business initiative where big data can provide competitive advantage or business differentiation Determine – and envision – how big data can deliver the required analytic insights Define over-arching data strategy (acquisition, transformation, enrichment) Build analytic models and insights Implement big data infrastructure, technologies, and architectures Integrate analytic insights into applications and business processes Assessment: B. While I think I got the process right (especially starting with the Business Initiatives, and putting the technology toward the end), I missed on the importance of identifying the business stakeholder decisions necessary to support the targeted business initiative. It is the decisions (or use cases, which we define as clusters of decisions around a common subject area) that are the linkage point between the business stakeholders and the data science team. Here is an additional blog that further drills down into the importance of the role of decisions in delivering business benefits: Big Data Journey: Earning the Trust of the Business Landmark #5: Operationalize Big Data The purpose of Landmark #5 was to define a data science process that supported the continuous development and refinement of data and analytics in operationalizing the organization’s big data capabilities. This process included the following steps: Collaborate with the business stakeholders to capture new business requirements Acquire, prepare, and enrich the data; acquire new structured and unstructured sources of data from internal and external sources Continuously update and refine analytic models; embrace an experimentation approach to ensure on-going model relevance Publish analytic insights back into applications and operational and management systems Measure decision and business effectiveness in order to continuously fine-tune analytic models, business processes, and applications Assessment: C-. While again I think I got the process right, recent developments in determining the economic value of data and analytics will greatly enhance the business critical nature of this process. Data and analytics as digital assets exhibit unique characteristics (i.e., an asset that appreciates, not depreciates, with usage and can be used simultaneously across multiple business use cases) to make them game-changing assets in which to invest. All I can say at this point is “Watch this space” because “you ain’t seen nothing yet!” Blogs that expand on data and analytics operationalization concepts include: Determining the Economic Value of Data Chief Data Officer Toolkit: Leading the Digital Business Transformation – Part I Chief Data Officer Toolkit: Leading the Digital Business Transformation – Part II Landmark #6: Value Creation City The purpose of Landmark #6 was to provide some examples of the business functions that could benefit from big data including: Procurement to identify which suppliers are most cost-effective in delivering high-quality products on-time Product Development to identify product usage insights to speed product development and improve new product launches Manufacturing to flag machinery and process variances that might be indicators of quality problems Distribution to quantify optimal inventory levels and supply chain activities Marketing to identify which marketing campaigns are the most effective in driving engagement and sales Operations to optimize prices for “perishable” goods such as groceries, airline seats, and fashion merchandise Sales to optimize account targeting, resource allocation, and revenue forecasting Human Resources to identify the characteristics and behaviors of the most successful and effective employees Assessment: A. Yea, I felt all along that the real value of big data would only be realized when we got technology out of the way and instead focused on understanding where and how big data could deliver business value and business outcomes. As I like to say, the business is not interested in the 3 V’s of Big Data (Volume, Variety and Velocity) as much as the business is interested in the 4 M’s of Big Data: Make Me More Money! Blogs that go into more details on the business value aspects of big data include: The “4 Ms” of Big Data 4 M’s of Big Data: Make Me More Money! infographic
  2. Key Value: Focus on scaling to huge amounts of data Designed to handle massive load Based on Amazon’s Dynamo paper Data model: (global) collection of Key-Value pairs Dynamo ring partitioning and replication Big Table Clones Like column oriented Relational Databases, but with a twist Tables similarly to RDBMS, but handles semi-structured ๏Based on Google’s BigTable paper Data model: ‣Columns → column families → ACL ‣Datums keyed by: row, column, time, index ‣Row-range → tablet → distribution Document Similar to Key-Value stores,but the DB knows what theValue is Inspired by Lotus Notes Data model: Collections of Key-Value collections Documents are often versioned GraphDB Focus on modeling the structure of data – interconnectivity Scales to the complexity of the data Inspired by mathematical Graph Theory ( G=(E,V) ) Data model: “Property Graph” ‣Nodes ‣Relationships/Edges between Nodes (first class) ‣Key-Value pairs on both ‣Possibly Edge Labels and/or Node/Edge Types