This document discusses how TrustYou processes large amounts of hotel review data to provide summaries to travelers. It crawls over 30 million reviews daily across 25 languages. Natural language processing and machine learning techniques are used to analyze the text and provide recommendations. Workflows are managed through Luigi and tasks include crawling, text processing, modeling word embeddings, and powering a sample application. Hadoop and Python are used extensively to handle the large scale processing.
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...PerformanceIN
The buying and selling of video advertising is not new. A more recent development has been seen in the rise of the all-consuming publisher, and the amount of brands adopting video to target consumers across different channels.
Then comes the tricky part: the measurement. In measuring the 'performance' of online video, what metrics should we be using to understand its impact?
In this session, Head of International and Programmatic at Tremor Video, Greg Smith, will guide PMI attendees through the potential of video advertising and what advertisers, publishers and agencies really should be looking for when grading success.
Helping travelers make better hotel choices - 500 million times a month
TrustYou analyzes online hotel reviews to create a summary for every hotel in the world. What do travelers think of the service? Is this hotel suitable for business travelers? TrustYou data is integrated on countless websites (Trivago, Wego, Kayak), helping travelers make better choices. Try it out yourself on http://www.trust-score.com/
TrustYou runs almost exclusively on Python. Every week, we find 3 million new hotel reviews on the web, process them, analyze the text using Natural Language Processing, and update our database of 600,000 hotels. In this talk, Steffen will give insights into how Python is used at TrustYou to collect, analyze and visualize these large amounts of data.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Software Architecture: Principles, Patterns and PracticesGanesh Samarthyam
Are you a developer or designer aspiring to become an architect? Do you want to learn about the architecture of open source applications? Do you want to learn software architecture through case studies and examples? If you have answered “yes” to any of these questions, this presentation is certainly for you. This presentation will introduce you to key topics in software architecture including architectural principles, constraints, non-functional requirements (NFRs), architectural styles and design patterns, viewpoints and perspectives, and architecture tools. A special feature of this workshop: it covers examples and case studies from open source applications. What’s more, you’ll also get exposed to some free or open source tools used by practicing software architects.
Contents overview:
* Introduction to SA
* Overview of design principles, patterns and architectural styles
* Realising quality requirements (NFRs)
* Case studies: Architecture of well-known open source applications
* Tools: Free or open source tools for software architects
* Must to read books on software architecture
(Presented in OSI Days workshop in Bangalore on 19th Nov 2015).
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...PerformanceIN
The buying and selling of video advertising is not new. A more recent development has been seen in the rise of the all-consuming publisher, and the amount of brands adopting video to target consumers across different channels.
Then comes the tricky part: the measurement. In measuring the 'performance' of online video, what metrics should we be using to understand its impact?
In this session, Head of International and Programmatic at Tremor Video, Greg Smith, will guide PMI attendees through the potential of video advertising and what advertisers, publishers and agencies really should be looking for when grading success.
Helping travelers make better hotel choices - 500 million times a month
TrustYou analyzes online hotel reviews to create a summary for every hotel in the world. What do travelers think of the service? Is this hotel suitable for business travelers? TrustYou data is integrated on countless websites (Trivago, Wego, Kayak), helping travelers make better choices. Try it out yourself on http://www.trust-score.com/
TrustYou runs almost exclusively on Python. Every week, we find 3 million new hotel reviews on the web, process them, analyze the text using Natural Language Processing, and update our database of 600,000 hotels. In this talk, Steffen will give insights into how Python is used at TrustYou to collect, analyze and visualize these large amounts of data.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Software Architecture: Principles, Patterns and PracticesGanesh Samarthyam
Are you a developer or designer aspiring to become an architect? Do you want to learn about the architecture of open source applications? Do you want to learn software architecture through case studies and examples? If you have answered “yes” to any of these questions, this presentation is certainly for you. This presentation will introduce you to key topics in software architecture including architectural principles, constraints, non-functional requirements (NFRs), architectural styles and design patterns, viewpoints and perspectives, and architecture tools. A special feature of this workshop: it covers examples and case studies from open source applications. What’s more, you’ll also get exposed to some free or open source tools used by practicing software architects.
Contents overview:
* Introduction to SA
* Overview of design principles, patterns and architectural styles
* Realising quality requirements (NFRs)
* Case studies: Architecture of well-known open source applications
* Tools: Free or open source tools for software architects
* Must to read books on software architecture
(Presented in OSI Days workshop in Bangalore on 19th Nov 2015).
NoSQL - MongoDB. Agility, scalability, performance. I am going to talk about the basis of NoSQL and MongoDB. Why some projects requires RDBMs and another NoSQL databases? What are the pros and cons to use NoSQL vs. SQL? How data are stored and transefed in MongoDB? What query language is used? How MongoDB supports high availability and automatic failover with the help of the replication? What is sharding and how it helps to support scalability?. The newest level of the concurrency - collection-level and document-level.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Relational databases were created a long time ago for a simpler world. Even if they are still awesome tools for generic workloads, there are some things they cannot do well.
In this session I will speak about purpose-built databases that you can use for specific business scenarios. We will see the type of queries you can run on a Graph database, a Document Database, and a Time-Series database. We will then see how a relational database could also be used for the same use cases, just in a much more complex way.
Rdio's Alex Gaynor at Heroku's Waza 2013: Why Python, Ruby and Javascript are...Heroku
Rdio Software Engineer Alex Gaynor (@alex_gaynor) took to the #Waza 2013 stage (Heroku's Developer Conference) to talk about "Why Python, Ruby and Javascript are Slow". Gaynor argues that developers should aim to make performance beautiful. For more from Gaynor or to contact him, ping him at @Alex_Gaynor.
For more on Waza visit http://waza.heroku.com/2013.
For Waza videos stay tuned at http://blog.heroku.com or visit http://vimeo.com/herokuwaza
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Get started with Lua - Hackference 2016Etiene Dalcol
Lua is a very fast, elegant and powerful dynamic language. It’s an excellent tool for robust applications or slim embedded systems. It found a niche in game development with big names such as “Grim Fandango”, “World of Warcraft” and “Angry Birds”. This talk will present what makes Lua different from other interpreted languages, the evolution of the Lua ecosystem, some key concepts of the language, and show you why Lua is the next language to add to your skill set.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Migrate module tour by Moshe Weitzman of Acquia. Presented at Drupalcon London 2011. See http://london2011.drupal.org/conference/sessions/data-migration-drupal
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
Esta charla se pregunta sobre el rol del Big Data en las Smart Cities y la construcción de la ciudad futura. Gracias al desarrollo de campos como el Data Science, Internet of Things y Urban Analytics, surgen nuevas maneras de comprender las dinámicas y los entornos urbanos.
Los "Entornos Naturalmente Inteligentes" son la visión de una ciudad futura, como un organismo vivo y complejo que se adapta, se transforma y se reinventa; este proceso, es una búsqueda constante por construir nuevas maneras más sostenibles de coexistir con otros sistemas.
Estamos en un momento fascinante en el área de salud. Hoy en día es posible tener diagnósticos clínicos muy oportunos y generar predicciones en tiempo real, lo cual abre espacios que impactarán a la sociedad de forma muy positiva. Uno de éstos es la medicina de precisión que trata de explotar insights de condiciones biológicas, de entorno y hábitos para mejorar de forma preventiva la salud en los individuos.
Llegó el momento... las predicciones del futuro son ahora y en Colombia ya se están dando los primeros pasos!
Deep learning: el renacimiento de las redes neuronalesBig Data Colombia
El deep learning, o aprendizaje profundo, ha revolucionado el panorama del aprendizaje automático, en particular, y de la inteligencia artificial, en general. Los modelos de redes neuronales profundas (con un gran número de capas) han permitido obtener avances importantes en diversas tareas de aprendizaje, percepción y análisis de datos, que van desde la clasificación de imágenes hasta el reconocimiento del habla.
En la charla se presentarán, de manera general, los fundamentos de estos modelos y diferentes casos de aplicación en aprendizaje de la representación, visión por computador y análisis de texto entre otros. Se revisarán los avances teóricos y tecnológicos que han permitido abordar estos complejos problemas y se discutirá la experiencia tecnológica y científica en proyectos de investigación adelantados en Colombia.
Presentador: Fabio Gonzalez. Profesor Titular del Depto. de Ingeniería de Sistemas e Industrial de la Universidad Nacional de Colombia, donde lidera el Laboratorio de aprendizaje, percepción y descubrimiento automático (MindLab). Su trabajo de investigación se concentra en el aprendizaje automático, la recuperación de información y la visión por computador, con aplicaciones en campos diversos como el análisis de imágenes médicas, el análisis automático de textos y el aprendizaje a partir de información multimodal, entre otros.
Un estudio reportado por la Harvard Business Review muestra las tres estrategias encontradas para explotar totalmente las capacidades de Big Data y Analytics en una organización, estas son: 1) identificar, combinar y manejar múltiples fuentes de datos. 2) Construir modelos analíticos avanzados para predecir y optimizar resultados. 3) Transformar las capacidades de la organización de tal forma que los datos utilizados y el análisis de los mismos lleven a tomar mejores decisiones. El modelo de Cloud computing sirve para cada uno de las capacidades anteriormente mencionadas.
https://www.youtube.com/watch?v=eXtWRkfMisM
Esta charla presentará conceptos introductorios de Machine Learning haciendo uso de kaggle.com (El portal de Data Scientists más grande del mundo). La charla se divide en:
1. Introducción a kaggle.com
2. Competencias de Machine Learning
3. Kaggle.com como sitio de contratación/búsqueda de trabajo
4. Cómo competir y obtener buenos resultados en competencias de ML
5. Ejemplos prácticos de competencias pasadas
https://www.youtube.com/watch?v=eXtWRkfMisM
Durante el 2012 el nivel de fraude en tarjeta de crédito llego a 11.3 billones de dólares, un aumento de casi un 15% comparado con el 2011, esto demuestra el problema que el fraude representa no solo a las instituciones financieras sino también para la sociedad. Tradicionalmente la prevención del fraude consistía en proteger físicamente la infraestructura, sin embargo con cada vez más medios y canales de pago, la información financiera se ha vuelto cada vez más susceptible a ser hurtada. La siguiente opción para prevenir y controlar el fraude consiste en determinar si una transacción está siendo realizada por el cliente de acuerdo con sus patrones históricos de comportamiento. Este es el enfoque de Fraud Analytics.
En esta presentación se mostrara cómo es posible por medio de Fraud Analytics, determinar la probabilidad que una transacción sea o no realizada por el cliente, utilizando la información de compra de los clientes, sus interacciones con la entidad financiera, y por medio de análisis de redes sociales. Adicionalmente, se discutirán y compararan los resultados de las comúnmente utilizadas reglas de decisión y modelos avanzados de Machine Learning.
Realizar análisis de datos cuando se tienen que cruzar grandes cantidades de información, procesarla y limpiarla es un reto difícil y dispendioso. Apache Spark es un framework para procesar grandes cantidades de información.
Introducción a las bodegas de datos: qué son y para qué son. Metodologías para el diseño y construcción de una bodega de datos, procesos ETL e integración de tecnologías.
El mundo de Big Data y Data Science es altamente técnico, pero entender cuáles son sus ideas centrales no requiere súper poderes. Explicaremos en qué consiste esta fascinante tendencia tecnológica y sus principales conceptos, herramientas y posibilidades.
¿Wearables para medir el progreso de una enfermedad? ¿Otorgar préstamos usando información de redes sociales? ¿Usar un algoritmo para encontrar la pareja ideal? Éstas son algunas de las cosas que Big Data Analytics está haciendo posible hoy en día. Veremos éstos y otros ejemplos de emprendimientos que están cambiando las reglas del juego.
Datos y más datos !! ... Todos días la humanidad genera información por todas partes, saber agruparla y tratarla es la esencia del movimiento impulsado por el Big Data...pero, el impacto en los negocios es algo que pocos nos cuentan, este será el eje de nuestra charla, conozca cómo integrando Big Data al proceso de toma de decisiones las empresas logran ventajas competitivas de este universo de información.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
1. Helping Travellers Make
Better Hotel Choices
500 Million Times a Month
Miguel Cabrera
@mfcabrera
https://www.flickr.com/photos/18694857@N00/5614701858/
3. • Neuberliner
• Ing. Sistemas e Inf. Universidad Nacional - Med
• M.Sc. In Informatics TUM, Hons. Technology
Management.
• Work for TrustYou as Data (Scientist|Engineer|
Juggler)™
• Founder and former organizer of Munich DataGeeks
ABOUT ME
35. • Build your own web crawlers
• Extract data via CSS selectors, XPath,
regexes, etc.
• Handles queuing, request parallelism,
cookies, throttling …
• Comprehensive and well-designed
• Commercial support by
http://scrapinghub.com/
36.
37.
38.
39.
40. • 2 - 3 million new reviews/week
• Customers want alerts 8 - 24h after review
publication!
• Smart crawl frequency & depth, but still high
overhead
• Pools of constantly refreshed EC2 proxy IPs
• Direct API connections with many sites
Crawling at TrustYou
41. • Custom framework very similar to scrapy
• Runs on Hadoop cluster (100 nodes)
• Not 100% suitable for MapReduce
• Nodes mostly waiting
• Coordination/messaging between nodes
required:
– Distributed queue
– Rate Limiting
Crawling at TrustYou
45. • “great rooms”
• “great hotel”
• “rooms are terrible”
• “hotel is terrible”
Text Processing
JJ NN
JJ NN
NN VB JJ
NN VB JJ
>> nltk.pos_tag(nltk.word_tokenize("hotel is
terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
46. • 25+ languages
• Linguistic system (morphology, taggers,
grammars, parsers …)
• Hadoop: Scale out CPU
• ~1B opinions in the database
• Python for ML & NLP libraries
Semantic Analysis
69. Luigi
• Dependency definition
• Hadoop / HDFS Integration
• Object oriented abstraction
• Parallelism
• Resume failed jobs
• Visualization of pipelines
• Command line integration
70. Minimal Bolerplate Code
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
71. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Task Parameters
72. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Programmatically Defined Dependencies
73. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Each Task produces an ouput
74. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Write Logic in Python
94. Snippets from Reviews
“Hips don’t lie”
“Maid was banging”
“Beautiful bowl flowers”
“Irish dance, I love that”
“No ghost sighting”
“One ghost touching”
“Too much cardio, not enough squats in the gym”
“it is like hugging a bony super model”
104. Takeaways
• It is possible to use Python as the primary
language for doing large data processing on
Hadoop.
• It is not a perfect setup but works well most of
the time.
• Keep your ecosystem open to other
technologies.