ElasticSearch è diventato una componente essenziale nelle architetture Big Data odierne (FastData), non solo per la sua funzione di motore di ricerca, ma soprattutto per il vantaggio competitivo che i suoi anaytics in real-time offrono. In questo breve talk vedremo il posizionamento di ElasticSearch all’interno del panorama NoSQL, esempi di architetture Big Data che sfruttano le sue caratteristiche e facilità di integrazione con tools come Apache Spark.
2. Alberto Paro
Laureato in Ingegneria Informatica (POLIMI)
Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech
review
Lavoro principalmente in Scala e su tecnologie BD
(Akka, Spray.io, Playframework, Apache Spark) e NoSQL
(Accumulo, Cassandra, ElasticSearch e MongoDB)
Evangelist linguaggio Scala e Scala.JS
3. ElasticSearch 5.x - Cookbook
Scegliere la migliore topologia cloud per il deploy e potenziarlo con
plugin esterni
Sviluppare mapping ottimizzati per avere un controllo completo sugli
step di indicizzazione
Costruire query complesse attraverso la corretta di gestione dei
mapping e dei documenti
Ottimizzazione dei risultati tramite le aggregazioni
Monitorare le perfomance dei nodi e dei cluster
Installare Kibana per monitor del cluster e estenderla con plugins.
Integrazione di ElasticSearch in Java, Scala, Python e in applicazione Big
Data
Codice Sconto Ebook: ALPOEB50
Codice Sconto Cartaceo: ALPOPR15
Valido fino al: 21 Giugno 2017
4.
5. Trasformare BigData in Valore
La ‘Datafication’
Actività
Conversazioni
Testo
Voce
Social Media
Browser log
Foto
Video
Sensori
Etc.
Volume
Veracity
Variety
Velocity
Big Data Analysing:
Text analytics
Sentiment analysis
Face recognition
Voice analytics
Movement analytics
Etc.
Valore
9. Hadoop MR vs Apache Spark
Input
Iter 1
HDFS
Iter 2
HDFS
HDFS
Read
HDFS
Read
HDFS
Write
HDFS
Write
Input
Iter 1 Iter 2
Evoluzione del modello Map Reduce
Scrittura su disco / Blocchi da 128Mb
In RAM, blocchi da 2MB
100 volte
più veloce
10. Apache Spark
Scritto in Scala con API in Java, Python e R
Evoluzione del modello Map/Reduce
Approccio funzionale alla gestione dei dati
Facile integrazione con tutti i datastore
Potenti moduli a corredo:
Spark SQL
Spark Streaming
MLLib (Machine Learning)
GraphX (graph)
13. Scrittura in Elasticsearch via Spark
Inizializzazione di Spark
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
Scrittura in Elasticsearch
Ogni Spark RDD e DataStream può essere scritto in Elasticsearch tramite un
semplice comando:
EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" ->
"geonameid"))
14. Lettura di ES via Spark
Lettura via Query
...
import org.elasticsearch.spark._
...
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")
SparkSQL Query
...
val df =
spark.read.format(“org.elasticsearch.spark.sql”).load(“spark/person_id”)
...
df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- username: string (nullable = true)
spark.sql(
“CREATE TEMPORARY VIEW persons USING org.elasticsearch.spark.sql
OPTIONS (resource ‘spark/person_id’, scroll_size ‘2000’)” )
val over20 = spark.sql(“SELECT * FROM persons WHERE age >= 20”)
Landmark #1: Explosive Market Dynamics
The purpose of Landmark #1 was to highlight the market challenges that were necessitating a different approach to integrating big data (data and analytics) into one’s business (we used cute landmarks instead of phases to keep in the spirit of the storymap).
In the original blog, we discussed how organizations that don’t adapt to big data risk the following impacts to their business models:
Profit and margin declines
Market share losses
Competitors innovating faster
Missed business opportunities
We also provided some examples of how organizations could exploit big data to power their businesses, including:
Mine social and mobile data to uncover customers’ interests, passions, associations, and affiliations
Exploit machine data for predictive maintenance and operational optimization
Leverage behavioral insights to create a more compelling user experience
Integrate new big data innovations to modernize data warehouse and business intelligence environments (real-time insights, predictive analytics)
Become a data-driven culture
Nurture and invest in data assets
Cultivate analytic models and insights as intellectual property
Assessment: A+. Yea, I think we got this one right. The business potential is too significant for organizations to ignore, and the Internet of Things (IoT) is only going to make data and analytics more indispensable to the future success of an organization. Also, if I were to redo the storymap, I’d definitely replace the river with a lake.
For more business challenges and opportunities afforded by big data, check out these blogs:
Driving Digital Business Transformation
KPMG Survey: Firms Struggle with Big Data
The Mid-market Big Data Call to Action
Landmark #2: Business and IT Challenges
The purpose of Landmark #2 was to highlight the significant challenges that organizations faced in trying to transform their business intelligence and data warehouse environments to take advantage of the business benefits offered by big data.
The chart highlighted how traditional business intelligence and data warehouse environments are going to struggle to manage and analyze new data sources because of the following challenges:
Rigid data warehouse architectures that impede exploiting immediate business opportunities
Retrospective analysis that report what happened but doesn’t guide business decisions
Social, mobile, or machine insights that are not available in an actionable manner
Batch-oriented processes which delay access to the data for immediate analysis and action
Brittle and labor intensive processes to add new data sources, reports, and analytics
Environments that were performance and scalability challenged as data scales to petabytes
Business analysis limited to aggregated and sampled data views
Analytic environments unable to handle the tsunami of new, external unstructured data sources
Assessment: C. I under-estimated the cultural challenges of moving from Business Intelligence / Data Warehouse to Data Science / Data Lake; the challenge to unlearn old approaches so that one can embrace new approaches. I also missed the growing important of the data lake as more than just a data repository; that the data lake would transform into the organization’s collaborative value creation platform that brings Business and IT stakeholders together to exploit the economic value of data and analytics.
For more details on the challenges of transforming from a Business Intelligence to Data Science mentality, check out the below blogs:
Dynamic Duo of Analytic Power: Business Intelligence Analyst PLUS Data Scientist
New Roles in the Big Data World
How I’ve Learned To Stop Worrying And Love The Data Lake
Data Lake Plumbers: Operationalizing the Data Lake
Landmark #3: Big Data Business Transformation
The purpose of Landmark #3 was to provide a benchmark that helped organizations understand how effective they were in leveraging data and analytics to power their business models. The Big Data Business Model Maturity Index introduced 5 stages of measuring how effective organizations are at exploiting the business transformation potential of big data:
Business Monitoring – deploys business intelligence to monitor on-going business performance
Business Insights – leverages predictive analytics to uncover actionable insights buried in the detailed transactional data plus the growing wealth of internal and publicly available external data – at the level of the individual (think individual behavioral analysis)
Business Optimization – embeds prescriptive analytics (think recommendations) into existing business processes to optimize select business operations
Data Monetization – aggregates the insights gathered at the individual level to identify “white spaces” in unmet market and customer demand that can lead to new products, services, markets, channels, partners, audiences, etc.
Business Metamorphosis – the cultural transformation to data and analytics as the center of the organization with incentives around the collection, transformation, and sharing of data and analytics including how employees are hired, paid, promoted, and managed.
Assessment: A+. Nailed it! While the phase descriptions have evolved as we have learned more, this is probably my most important contribution to the world of Big Data – the “Big Data Business Model Maturity Index.” Not only does the maturity index help organizations understand where they are today with respect to leveraging the business model potential of big data, but it provides a guide to help them become more effective. Yeah, I finally got one right!!
If you are interested in learning more about the “Big Data Business Model Maturity Index,” check out these blogs:
Big Data Business Model Maturity
Big Data Business Model Maturity Guide
De-mystifying the Big Data Business Model Maturity Index
Driving Digital Business Transformation
Landmark #4: Big Data Journey
The purpose of Landmark #4 was to define a process that drives alignment between IT and the Business to deliver actionable, business relevant outcomes. The steps in the process were:
Identify the targeted business initiative where big data can provide competitive advantage or business differentiation
Determine – and envision – how big data can deliver the required analytic insights
Define over-arching data strategy (acquisition, transformation, enrichment)
Build analytic models and insights
Implement big data infrastructure, technologies, and architectures
Integrate analytic insights into applications and business processes
Assessment: B. While I think I got the process right (especially starting with the Business Initiatives, and putting the technology toward the end), I missed on the importance of identifying the business stakeholder decisions necessary to support the targeted business initiative. It is the decisions (or use cases, which we define as clusters of decisions around a common subject area) that are the linkage point between the business stakeholders and the data science team.
Here is an additional blog that further drills down into the importance of the role of decisions in delivering business benefits:
Big Data Journey: Earning the Trust of the Business
Landmark #5: Operationalize Big Data
The purpose of Landmark #5 was to define a data science process that supported the continuous development and refinement of data and analytics in operationalizing the organization’s big data capabilities. This process included the following steps:
Collaborate with the business stakeholders to capture new business requirements
Acquire, prepare, and enrich the data; acquire new structured and unstructured sources of data from internal and external sources
Continuously update and refine analytic models; embrace an experimentation approach to ensure on-going model relevance
Publish analytic insights back into applications and operational and management systems
Measure decision and business effectiveness in order to continuously fine-tune analytic models, business processes, and applications
Assessment: C-. While again I think I got the process right, recent developments in determining the economic value of data and analytics will greatly enhance the business critical nature of this process. Data and analytics as digital assets exhibit unique characteristics (i.e., an asset that appreciates, not depreciates, with usage and can be used simultaneously across multiple business use cases) to make them game-changing assets in which to invest. All I can say at this point is “Watch this space” because “you ain’t seen nothing yet!”
Blogs that expand on data and analytics operationalization concepts include:
Determining the Economic Value of Data
Chief Data Officer Toolkit: Leading the Digital Business Transformation – Part I
Chief Data Officer Toolkit: Leading the Digital Business Transformation – Part II
Landmark #6: Value Creation City
The purpose of Landmark #6 was to provide some examples of the business functions that could benefit from big data including:
Procurement to identify which suppliers are most cost-effective in delivering high-quality products on-time
Product Development to identify product usage insights to speed product development and improve new product launches
Manufacturing to flag machinery and process variances that might be indicators of quality problems
Distribution to quantify optimal inventory levels and supply chain activities
Marketing to identify which marketing campaigns are the most effective in driving engagement and sales
Operations to optimize prices for “perishable” goods such as groceries, airline seats, and fashion merchandise
Sales to optimize account targeting, resource allocation, and revenue forecasting
Human Resources to identify the characteristics and behaviors of the most successful and effective employees
Assessment: A. Yea, I felt all along that the real value of big data would only be realized when we got technology out of the way and instead focused on understanding where and how big data could deliver business value and business outcomes. As I like to say, the business is not interested in the 3 V’s of Big Data (Volume, Variety and Velocity) as much as the business is interested in the 4 M’s of Big Data: Make Me More Money!
Blogs that go into more details on the business value aspects of big data include:
The “4 Ms” of Big Data
4 M’s of Big Data: Make Me More Money! infographic
Key Value:
Focus on scaling to huge amounts of data
Designed to handle massive loadBased on Amazon’s Dynamo paperData model: (global) collection of Key-Value pairs
Dynamo ring partitioning and replication
Big Table Clones
Like column oriented Relational Databases, but with a twist
Tables similarly to RDBMS, but handles semi-structured ๏Based on Google’s BigTable paperData model: ‣Columns → column families → ACL
‣Datums keyed by: row, column, time, index ‣Row-range → tablet → distribution
Document
Similar to Key-Value stores,but the DB knows what theValue is
Inspired by Lotus NotesData model: Collections of Key-Value collectionsDocuments are often versioned
GraphDB
Focus on modeling the structure of data – interconnectivity
Scales to the complexity of the dataInspired by mathematical Graph Theory ( G=(E,V) )
Data model: “Property Graph” ‣Nodes ‣Relationships/Edges between Nodes (first class) ‣Key-Value pairs on both‣Possibly Edge Labels and/or Node/Edge Types