Presentazione dell'evento EsInRome del 7 Febbraio 2017 - Integrazione Elasticsearch in architettura BigData e facilità di integrazione con Apache Spark.
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
You're stuck on a basic Windows estate, you can't pull the data out, there's no SIEM, and you have 20GB of logs you've been tasked to turn into actionable intelligence. Powershell brings not just in-built tools for querying Windows event logs, but also extremely powerful text processing tools. This talk will give you a quick overview of these features and its notable quirks, allowing you to pull off tricks that are often thought to be only for *NIX environments.
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
You're stuck on a basic Windows estate, you can't pull the data out, there's no SIEM, and you have 20GB of logs you've been tasked to turn into actionable intelligence. Powershell brings not just in-built tools for querying Windows event logs, but also extremely powerful text processing tools. This talk will give you a quick overview of these features and its notable quirks, allowing you to pull off tricks that are often thought to be only for *NIX environments.
Accelerating Local Search with PostgreSQL (KNN-Search)Jonathan Katz
KNN-GiST indexes were added in PostgreSQL 9.1 and greatly accelerate some common queries in the geospatial and textual search realms. This presentation will demonstrate the power of KNN-GiST indexes on geospatial and text searching queries, but also their present limitations through some of my experimentations. I will also discuss some of the theory behind KNN (k-nearest neighbor) as well as some of the applications this feature can be applied too.
To see a version of the talk given at PostgresOpen 2011, please visit http://www.youtube.com/watch?v=N-MD08QqGEM
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
In the past year there has been a tremendous amount of activity on Scala APIs for Hadoop. In this talk we`ll talk about writing Map/Reduce jobs in a more functional manner and explore the three most popular Scala packages for Hadoop: Scalding, Scoobi and Scrunch. Detailed usage examples will be provided for each along with some real world use cases.
This presentation will demonstrate how you can use the aggregation pipeline with MongoDB similar to how you would use GROUP BY in SQL and the new stage operators coming 3.4. MongoDB’s Aggregation Framework has many operators that give you the ability to get more value out of your data, discover usage patterns within your data, or use the Aggregation Framework to power your application. Considerations regarding version, indexing, operators, and saving the output will be reviewed.
A talk in JSDC.tw 2014. I introduce the advantage and disadvantage to write JavaScript in functional style. It covers simple Functional Programming concepts, how JavaScript becomes more functional, and all the difficulties people may encounter.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
Scala er et Java-relateret, statisk typet programmeringssprog i hastig fremmarch. Sproget kombinerer aspekter fra objekt- og funktionsorienterede sprog og fokuserer på skalerbarhed og effektivitet, både på det kodemæssige og afviklingsmæssige niveau. Syntaksen er elegant og koncis. Samtidig indeholder sproget stærke konstruktioner til understøttelse af parallelle applikationer, der udnytter fremtidens hardwarearkitekturer.
Scala er et Java-relateret, statisk typet programmeringssprog i hastig fremmarch. Sproget kombinerer aspekter fra objekt- og funktionsorienterede sprog og fokuserer på skalerbarhed og effektivitet, både på det kodemæssige og afviklingsmæssige niveau. Syntaksen er elegant og koncis. Samtidig indeholder sproget stærke konstruktioner til understøttelse af parallelle applikationer, der udnytter fremtidens hardwarearkitekturer.
Java som sprog har ikke bevæget sig meget de seneste år. Vi har stadig ikke closures eller funktionelle aspekter som f.eks. C# har haft siden version 3. Er Scala svaret på enhver Javaudviklers bønner eller er sproget kun interessant for tågehoveder som mig, som begynder at synes bedre og bedre om funktionsorientering? Er den store portion syntaktisk sukker, Scala bringer på bordet, bare tomme kalorier?
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
Accelerating Local Search with PostgreSQL (KNN-Search)Jonathan Katz
KNN-GiST indexes were added in PostgreSQL 9.1 and greatly accelerate some common queries in the geospatial and textual search realms. This presentation will demonstrate the power of KNN-GiST indexes on geospatial and text searching queries, but also their present limitations through some of my experimentations. I will also discuss some of the theory behind KNN (k-nearest neighbor) as well as some of the applications this feature can be applied too.
To see a version of the talk given at PostgresOpen 2011, please visit http://www.youtube.com/watch?v=N-MD08QqGEM
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
In the past year there has been a tremendous amount of activity on Scala APIs for Hadoop. In this talk we`ll talk about writing Map/Reduce jobs in a more functional manner and explore the three most popular Scala packages for Hadoop: Scalding, Scoobi and Scrunch. Detailed usage examples will be provided for each along with some real world use cases.
This presentation will demonstrate how you can use the aggregation pipeline with MongoDB similar to how you would use GROUP BY in SQL and the new stage operators coming 3.4. MongoDB’s Aggregation Framework has many operators that give you the ability to get more value out of your data, discover usage patterns within your data, or use the Aggregation Framework to power your application. Considerations regarding version, indexing, operators, and saving the output will be reviewed.
A talk in JSDC.tw 2014. I introduce the advantage and disadvantage to write JavaScript in functional style. It covers simple Functional Programming concepts, how JavaScript becomes more functional, and all the difficulties people may encounter.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
Scala er et Java-relateret, statisk typet programmeringssprog i hastig fremmarch. Sproget kombinerer aspekter fra objekt- og funktionsorienterede sprog og fokuserer på skalerbarhed og effektivitet, både på det kodemæssige og afviklingsmæssige niveau. Syntaksen er elegant og koncis. Samtidig indeholder sproget stærke konstruktioner til understøttelse af parallelle applikationer, der udnytter fremtidens hardwarearkitekturer.
Scala er et Java-relateret, statisk typet programmeringssprog i hastig fremmarch. Sproget kombinerer aspekter fra objekt- og funktionsorienterede sprog og fokuserer på skalerbarhed og effektivitet, både på det kodemæssige og afviklingsmæssige niveau. Syntaksen er elegant og koncis. Samtidig indeholder sproget stærke konstruktioner til understøttelse af parallelle applikationer, der udnytter fremtidens hardwarearkitekturer.
Java som sprog har ikke bevæget sig meget de seneste år. Vi har stadig ikke closures eller funktionelle aspekter som f.eks. C# har haft siden version 3. Er Scala svaret på enhver Javaudviklers bønner eller er sproget kun interessant for tågehoveder som mig, som begynder at synes bedre og bedre om funktionsorientering? Er den store portion syntaktisk sukker, Scala bringer på bordet, bare tomme kalorier?
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
Kotlin is a JVM language developed by Jetbrains. Its version 1.0 (production ready) was released at the beginning of the year and made some buzz within the android community. This session proposes to discover this language, which takes up some aspects of groovy or scala, and that is very close to swift in syntax and concepts. We will see how Kotlin boosts the productivity of Java & Android application development and how well it accompanies reactive development.
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016STX Next
Kotlin - one of the popular programming languages built on top of Java that runs on JVM. Thanks to JetBrains support and excellent IDE integration, it’s an ideal choice for Android development. 100% Java compatibility, interoperability and no runtime overhead is just the beginning of a long list of strengths. Kotlin is supposed to be a subset of SCALA, on one hand covering major advantages for developers and keeping short compile times on the other.
This presentation is a Developer Starter - a set of hand-picked information allowing a person with no knowledge of Kotlin to start writing basic Android activities and set up a kotlin-based Android project. It starts with language background, reasons for its creation and advantages. Then presents basic use cases, syntax, structures and patterns. Later on Kotlin is presented in Android context. Simple project structure, imports and Kotlin usage with Android SDK is explained. In the end cost of Kotlin compilation is presented and the language is compared to SCALA and SWIFT.
StxNext Lightning Talks - Feb 12, 2016
Kotlin - one of the popular programming languages built on top of Java that runs on JVM. Thanks to JetBrains support and excellent IDE integration, it’s an ideal choice when it comes to Android development. 100% Java compatibility, interoperability and no runtime overhead is just the beginning of a long list of strengths. Kotlin is supposed to be a subset of SCALA, on one hand covering major advantages for developers and on the other - keeping short compile times.
This presentation is a Developer Starter - a set of hand-picked information allowing a person with no knowledge of Kotlin to start writing basic Android activities and set up an Android-kotlin project. It starts with language background, reasons for its creation and advantages. Then presents basic use cases, syntax, structures and patterns. Later on Kotlin is presented in Android context. Simple project structure, imports and Kotlin usage with Android SDK is explained. In the end cost of Kotlin usage is presented and the language is compared to SCALA and SWIFT.
LUISS - Deep Learning and data analyses - 09/01/19Alberto Paro
My participation to the course "Data Analysis, Mobility, Proximity and App-based Marketing".
A new perspective on how data support companies on strategic decisions.
Elasticsearch in architetture Big Data - EsInADay-2017Alberto Paro
ElasticSearch è diventato una componente essenziale nelle architetture Big Data odierne (FastData), non solo per la sua funzione di motore di ricerca, ma soprattutto per il vantaggio competitivo che i suoi anaytics in real-time offrono. In questo breve talk vedremo il posizionamento di ElasticSearch all’interno del panorama NoSQL, esempi di architetture Big Data che sfruttano le sue caratteristiche e facilità di integrazione con tools come Apache Spark.
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
Using Elasticsearch in a BigData environment is very simple. In this talk, we analyse what's Big Data and we show how it is easy integrating ElasticSearch with Apache Spark
2016 02-24 - Piattaforme per i Big DataAlberto Paro
Saper valutare la corretta soluzione NoSQL o soluzione Big Data per il proprio business è essenziale. Non tutti i datastore NoSQL sono uguali come non sono uguali le necessità di trattamento del dato nel proprio business. Cerchiamo di fare chiarezza sui temi principali del Big Data.
What's Big Data? - Big Data Tech - 2015 - FirenzeAlberto Paro
Big Data Tech - 2015 - Florence
Technologie Big Data spiegate al Management
Comprendere i concetti del bigdata e gli strumenti che esistono per affrontarli (Nosql, Hadoop/Spark) sono essenziali al management attuale per poter affrontare le sfide di domani.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
2017 02-07 - elastic & spark. building a search geo locator
1. Roma – 7 Febbraio 2017
presenta Alberto Paro, Seacom
Elastic & Spark.
Building A Search Geo Locator
2. Alberto Paro
Laureato in Ingegneria Informatica (POLIMI)
Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech
review
Lavoro principalmente in Scala e su tecnologie BD
(Akka, Spray.io, Playframework, Apache Spark) e NoSQL
(Accumulo, Cassandra, ElasticSearch e MongoDB)
Evangelist linguaggio Scala e Scala.JS
3. Elasticseach 5.x - Cookbook
Choose the best ElasticSearch cloud topology to deploy and power it up
with external plugins
Develop tailored mapping to take full control of index steps
Build complex queries through managing indices and documents
Optimize search results through executing analytics aggregations
Monitor the performance of the cluster and nodes
Install Kibana to monitor cluster and extend Kibana for plugins.
Integrate ElasticSearch in Java, Scala, Python and Big Data applications
Discount code for Ebook: ALPOEB50
Discount code for Print Book: ALPOPR15
Expiration Date: 21st Feb 2017
4. Obiettivi
Architetture Big Data con ES
Apache Spark
GeoIngester
Data Collection
Ottimizzazione Indici
Ingestion via Apache Spark
Ricerca per un luogo
Cenni di Big Data Tools
6. Hadoop / Spark
Input
Iter 1
HDFS
Iter 2
HDFS
HDFS
Read
HDFS
Read
HDFS
Write
HDFS
Write
Input
Iter 1 Iter 2
Hadoop MapReduce
Apache Spark
Evoluzione del modello Map Reduce
7. Apache Spark
Scritto in Scala con API in Java, Python e R
Evoluzione del modello Map/Reduce
Potenti moduli a corredo:
Spark SQL
Spark Streaming
MLLib (Machine Learning)
GraphX (graph)
8. Geoname
GeoNames è un database geografico, scaricabile gratuitamente sotto
licenza creative commons.
Contiene circa 10 millioni di nomi geografici e consiste di circa 9
milioni di feature uniqche di cui 2.8 milioni di posti popolati e 5.5
millioni di nomi alternativi.
Può essere facilmente scaricato da
http://download.geonames.org/export/dump come file CSV.
Il codice è disponibile all’indirizzo:
https://github.com/aparo/elasticsearch-geonames-locator
9. Geoname - Struttura
No. Attribute name Explanation
1 geonameid Unique ID for this geoname
2 name The name of the geoname
3 asciiname ASCII representation of the name
4 alternatenames Other forms of this name. Generally in several languages
5 latitude Latitude in decimal degrees of the Geoname
6 longitude Longitude in decimal degrees of the Geoname
7 fclass Feature class see http://www.geonames.org/export/codes.html
8 fcode Feature code see http://www.geonames.org/export/codes.html
9 country ISO-3166 2-letter country code
10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code
11 admin1 Fipscode (subject to change to iso code
12 admin2 Code for the second administrative division, a county in the US
13 admin3 Code for third level administrative division
14 admin4 Code for fourth level administrative division
14 population The Population of Geoname
14 elevation The elevation in meters of Geoname
14 gtopo30 Digital elevation model
14 timezone The timezone of Geoname
14 moddate The date of last change of this Geoname
10. Ottimizzazione indici – 1/2
Necessario per:
Rimuove campi non richiesti.
Gestire campi Geo Point.
Ottimizzare i campi stringa (text, keyword)
Numeri shard corretto (11M records => 2 shards)
Vantaggi => performances/spazio/CPU
12. Ingestion via Spark – GeonameIngester – 1/7
Il nostro ingester eseguirà i seguenti steps:
Inizializzazione Job Spark
Parse del CSV
Definizione della struttura di indicizzazione
Popolamento delle classi
Scrittura dati in Elasticsearch
Esecuzione del Job Spark
13. Ingestion via Spark – GeonameIngester – 2/7
Inizializzazione di un Job Spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.elasticsearch.spark.rdd.EsSpark
import scala.util.Try
object GeonameIngester {
def main(args: Array[String]) {
val sparkSession = SparkSession.builder
.master("local")
.appName("GeonameIngester")
.getOrCreate()
14. Ingestion via Spark – GeonameIngester – 3/7
Parse del CSV
val geonameSchema = StructType(Array(
StructField("geonameid", IntegerType, false),
StructField("name", StringType, false),
StructField("asciiname", StringType, true),
StructField("alternatenames", StringType, true),
StructField("latitude", FloatType, true), ….
val GEONAME_PATH = "downloads/allCountries.txt"
val geonames = sparkSession.sqlContext.read
.option("header", false)
.option("quote", "")
.option("delimiter", "t").option("maxColumns", 22)
.schema(geonameSchema)
.csv(GEONAME_PATH)
.cache()
15. Ingestion via Spark – GeonameIngester – 4/7
Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String],
latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String,
cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4:
Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate:
String)
implicit def emptyToOption(value: String): Option[String] = {
if (value == null) return None
val clean = value.trim
if (clean.isEmpty) { None } else { Some(clean)}
}
16. Ingestion via Spark – GeonameIngester – 5/7
Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String],
latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String,
cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4:
Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate:
String)
implicit def emptyToOption(value: String): Option[String] = {
if (value == null) return None
val clean = value.trim
if (clean.isEmpty) { None } else { Some(clean)}
}
17. Ingestion via Spark – GeonameIngester – 6/7
Popolazione delle nostre classi
val records = geonames.map {
row =>
val id = row.getInt(0)
val lat = row.getFloat(4)
val lon = row.getFloat(5)
Geoname(id, row.getString(1), row.getString(2),
Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil),
lat, lon, GeoPoint(lat, lon),
row.getString(6), row.getString(7), row.getString(8), row.getString(9),
row.getString(10), row.getString(11), row.getString(12), row.getString(13),
row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17),
row.getDate(18).toString
)
}
18. Ingestion via Spark – GeonameIngester – 7/7
Scrittura in Elasticsearch
EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" ->
"geonameid"))
Esecuzione di uno Spark Job
spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-
assembly-1.0.jar
(~20 minuti su singola macchina)
Key Value:
Focus on scaling to huge amounts of data
Designed to handle massive loadBased on Amazon’s Dynamo paperData model: (global) collection of Key-Value pairs
Dynamo ring partitioning and replication
Big Table Clones
Like column oriented Relational Databases, but with a twist
Tables similarly to RDBMS, but handles semi-structured ๏Based on Google’s BigTable paperData model: ‣Columns → column families → ACL
‣Datums keyed by: row, column, time, index ‣Row-range → tablet → distribution
Document
Similar to Key-Value stores,but the DB knows what theValue is
Inspired by Lotus NotesData model: Collections of Key-Value collectionsDocuments are often versioned
GraphDB
Focus on modeling the structure of data – interconnectivity
Scales to the complexity of the dataInspired by mathematical Graph Theory ( G=(E,V) )
Data model: “Property Graph” ‣Nodes ‣Relationships/Edges between Nodes (first class) ‣Key-Value pairs on both‣Possibly Edge Labels and/or Node/Edge Types