Recommender Systems meet Finance - A literature reviewDavid Zibriczky
The present work overviews the application of recommender systems in various financial domains. The relevant literature is investigated based on two directions. First, a domain-based categorization is discussed focusing on those recommendation problems, where the existing literature is significant. Second, the application of various recommendation algorithms and data mining techniques is summarized. The purpose of this paper is to provide a basis for recommender system- and financial experts to work out further scientific contributions in this field.
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
Moneta has repeatedly been recognized as the most innovative bank on the Czech market. This is due in large part to their strategy of completely shifting to the cloud and using data and advanced analytics to innovate the customer experience with use cases ranging from real-time recommendations to fraud detection.
In this talk, we’ll share how we migrated to the cloud to create an agile environment for analytics and AI. From rapid prototyping machine learning use cases to moving models into production, core to this approach was building a unified platform for data and analytics on Apache Spark, Databricks and AWS. Discussion topics include:
Moneta’s strategy and roadmap for moving to the cloud and creation of the data squad
Overview of use cases including ATM/branch location optimization using geo-data, digital channel attribution, identify fraud detection, etc.
Deep dive into the use of digital behavioural data (web, mobile app, internet banking) and offline transactions to understand and predict customer needs in near-real time using Spark MLLib
Approach to building the agile analytics platform and the specific challenges of using the cloud in a financial institution
How Spark is Enabling the New Wave of Converged Cloud Applications MapR Technologies
Apache Spark has become the de-facto compute engine of choice for data engineers, developers, and data scientists because of its ability to run multiple analytic workloads with a single, general-purpose compute engine.
But is Spark alone sufficient for developing cloud-based big data applications? What are the other required components for supporting big data cloud processing? How can you accelerate the development of applications which extend across Spark and other frameworks such as Kafka, Hadoop, NoSQL databases, and more?
Recommender Systems meet Finance - A literature reviewDavid Zibriczky
The present work overviews the application of recommender systems in various financial domains. The relevant literature is investigated based on two directions. First, a domain-based categorization is discussed focusing on those recommendation problems, where the existing literature is significant. Second, the application of various recommendation algorithms and data mining techniques is summarized. The purpose of this paper is to provide a basis for recommender system- and financial experts to work out further scientific contributions in this field.
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
Moneta has repeatedly been recognized as the most innovative bank on the Czech market. This is due in large part to their strategy of completely shifting to the cloud and using data and advanced analytics to innovate the customer experience with use cases ranging from real-time recommendations to fraud detection.
In this talk, we’ll share how we migrated to the cloud to create an agile environment for analytics and AI. From rapid prototyping machine learning use cases to moving models into production, core to this approach was building a unified platform for data and analytics on Apache Spark, Databricks and AWS. Discussion topics include:
Moneta’s strategy and roadmap for moving to the cloud and creation of the data squad
Overview of use cases including ATM/branch location optimization using geo-data, digital channel attribution, identify fraud detection, etc.
Deep dive into the use of digital behavioural data (web, mobile app, internet banking) and offline transactions to understand and predict customer needs in near-real time using Spark MLLib
Approach to building the agile analytics platform and the specific challenges of using the cloud in a financial institution
How Spark is Enabling the New Wave of Converged Cloud Applications MapR Technologies
Apache Spark has become the de-facto compute engine of choice for data engineers, developers, and data scientists because of its ability to run multiple analytic workloads with a single, general-purpose compute engine.
But is Spark alone sufficient for developing cloud-based big data applications? What are the other required components for supporting big data cloud processing? How can you accelerate the development of applications which extend across Spark and other frameworks such as Kafka, Hadoop, NoSQL databases, and more?
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...OVHcloud
Find out how Cozy Cloud uses the OVH Metrics Data Platform to monitor and optimise its SaaS service for the general public. From performance data aggregation to customer usage metrics, the Cozy Cloud teams will share their data-centric collaboration experience with you.
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTKiththi Perera
ITU-TRCSL Symposium on Cloud Computing 2015 Colombo
Session 04: Big Data Strategy in the Cloud and Applications
Speaker's PPT by K. A. Kiththi Perera, Chief Enterprise and Wholesale Officer, Sri Lanka Telecom
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
RPA & AI IN THE CLOUD - THE LATEST IN AS-A-SERVICE SOLUTIONS FOR FINANCEITESOFT
Joint webinar in partnership with Fujitsu, looking at the latest product offering from ITESOFT, Capture-as-a-Service (CaaS).
Considering not just what the product offers, but why it has been created. The pains and challenges that it can overcome.
How First to Value Beats First to Market: Case Studies of Fast Data SuccessVoltDB
In this second installment of our Executive Webinar Series on Fast Data Strategy you will learn how innovative companies in Mobile/Telco, Financial Services, Media & Entertainment, and Internet of Things (IoT) have successfully tapped into the fast data opportunity. You will learn key metrics and evaluation criteria for three case studies that delivered superior value, profitability and growth. Niall Norton, CEO of Openet will outline how leveraging a disruptive database technology dramatically improved competitive differentiation and business operations. To view the webinar recording, click here: http://learn.voltdb.com/WRExecSeries2.html
Managing Large Scale Financial Time-Series Data with Graphs Objectivity
Slides from a recent webinar by Objectivity showing how the ThingSpan platform is ideal for graph analytics to uncover patterns and insights within large, complex data sets in order to make efficient decisions.
Apache Kafka® Use Cases for Financial Servicesconfluent
Traditional systems were designed in an era that predates large-scale distributed systems. These systems often lack the ability to scale to meet the needs of the modern data-driven organisation. Adding to this is the accumulation of technologies and the explosion of data which can result in complex point-to-point integrations where data becomes siloed or separated across the enterprise.
The demand for fast results and decision making, have generated the need for real-time event streaming and processing of data adoption in financial institutions to be on the competitive edge. Apache Kafka and the Confluent Platform are designed to solve the problems associated with traditional systems and provide a modern, distributed architecture and Real-time Data streaming capability. In addition these technologies open up a range of use cases for Financial Services organisations, many of which will be explored in this talk. .
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Kai Wähner
Streaming Analytics Comparison of Open Source Frameworks, Products and Cloud Services. Includes Apache Storm, Flink, Spark, TIBCO, IBM, AWS Kinesis, Striim, Zoomdata, ...
This session discusses the technical concepts of stream processing / streaming analytics and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.
The focus of the session lies on comparing
- different open source frameworks such as Apache Apex, Apache Flink or Apache Spark Streaming
- engines from software vendors such as IBM InfoSphere Streams, TIBCO StreamBase
- cloud offerings such as AWS Kinesis.
- real time streaming UIs such as Striim, Zoomdata or TIBCO Live Datamart.
Live demos will give the audience a good feeling about how to use these frameworks and tools.
The session will also discuss how stream processing is related to Apache Hadoop frameworks (such as MapReduce, Hive, Pig or Impala) and machine learning (such as R, Spark ML or H2O.ai).
OVH Analytics Data Compute and Apache Spark as a ServiceMojtaba Imani
If you have bigdata processing and you need a full up and ready private Apache Spark cluster just for you, OVH Analytics Data Compute is your answer. It will save your money and time alot.
Challenges in building a churn prediction model in different industries, presented by Jelena Pekez from Comtrade System Integration. Talk is focused on real-life use-case experience.
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...OVHcloud
Find out how Cozy Cloud uses the OVH Metrics Data Platform to monitor and optimise its SaaS service for the general public. From performance data aggregation to customer usage metrics, the Cozy Cloud teams will share their data-centric collaboration experience with you.
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTKiththi Perera
ITU-TRCSL Symposium on Cloud Computing 2015 Colombo
Session 04: Big Data Strategy in the Cloud and Applications
Speaker's PPT by K. A. Kiththi Perera, Chief Enterprise and Wholesale Officer, Sri Lanka Telecom
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
RPA & AI IN THE CLOUD - THE LATEST IN AS-A-SERVICE SOLUTIONS FOR FINANCEITESOFT
Joint webinar in partnership with Fujitsu, looking at the latest product offering from ITESOFT, Capture-as-a-Service (CaaS).
Considering not just what the product offers, but why it has been created. The pains and challenges that it can overcome.
How First to Value Beats First to Market: Case Studies of Fast Data SuccessVoltDB
In this second installment of our Executive Webinar Series on Fast Data Strategy you will learn how innovative companies in Mobile/Telco, Financial Services, Media & Entertainment, and Internet of Things (IoT) have successfully tapped into the fast data opportunity. You will learn key metrics and evaluation criteria for three case studies that delivered superior value, profitability and growth. Niall Norton, CEO of Openet will outline how leveraging a disruptive database technology dramatically improved competitive differentiation and business operations. To view the webinar recording, click here: http://learn.voltdb.com/WRExecSeries2.html
Managing Large Scale Financial Time-Series Data with Graphs Objectivity
Slides from a recent webinar by Objectivity showing how the ThingSpan platform is ideal for graph analytics to uncover patterns and insights within large, complex data sets in order to make efficient decisions.
Apache Kafka® Use Cases for Financial Servicesconfluent
Traditional systems were designed in an era that predates large-scale distributed systems. These systems often lack the ability to scale to meet the needs of the modern data-driven organisation. Adding to this is the accumulation of technologies and the explosion of data which can result in complex point-to-point integrations where data becomes siloed or separated across the enterprise.
The demand for fast results and decision making, have generated the need for real-time event streaming and processing of data adoption in financial institutions to be on the competitive edge. Apache Kafka and the Confluent Platform are designed to solve the problems associated with traditional systems and provide a modern, distributed architecture and Real-time Data streaming capability. In addition these technologies open up a range of use cases for Financial Services organisations, many of which will be explored in this talk. .
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Kai Wähner
Streaming Analytics Comparison of Open Source Frameworks, Products and Cloud Services. Includes Apache Storm, Flink, Spark, TIBCO, IBM, AWS Kinesis, Striim, Zoomdata, ...
This session discusses the technical concepts of stream processing / streaming analytics and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.
The focus of the session lies on comparing
- different open source frameworks such as Apache Apex, Apache Flink or Apache Spark Streaming
- engines from software vendors such as IBM InfoSphere Streams, TIBCO StreamBase
- cloud offerings such as AWS Kinesis.
- real time streaming UIs such as Striim, Zoomdata or TIBCO Live Datamart.
Live demos will give the audience a good feeling about how to use these frameworks and tools.
The session will also discuss how stream processing is related to Apache Hadoop frameworks (such as MapReduce, Hive, Pig or Impala) and machine learning (such as R, Spark ML or H2O.ai).
OVH Analytics Data Compute and Apache Spark as a ServiceMojtaba Imani
If you have bigdata processing and you need a full up and ready private Apache Spark cluster just for you, OVH Analytics Data Compute is your answer. It will save your money and time alot.
Challenges in building a churn prediction model in different industries, presented by Jelena Pekez from Comtrade System Integration. Talk is focused on real-life use-case experience.
Similar to Large scale data processing pipelines at trivago (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. Email: clemens.valiente@trivago.com
Phone: +49 (0) 211 540 65 - 137
de.linkedin.com/in/clemensvaliente
Lead BI Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
3. Data driven PR and External
Communication
Price information collected
from the various booking
websites and shown to our
visitors also gives us a
thorough overview over trends
and development of hotel
prices. This knowledge then is
used by our Content Marketing
& Communication Department
(CMC) to write stories and
articles.
3
4. 4
The past: Data pipeline 2010 – 2015
Java Software
Engineering
Business
Intelligence
CMC
5. 5
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
Size of data
- We collected a total of 56
billion prices in those five
years
- Towards the end of this
pipeline in 2015 the
average was around 100
million prices per day
written to BI
6. 6
The past: Data pipeline 2010 – 2015
Java Software
Engineering
Business
Intelligence
CMC
7. 7
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)
• reliable and resilient
• low performance impact on Java backend
• long term storage of raw input data
• fast processing of filtered and aggregated data
• Open source
• we want to log everything:
• more prices
• Length of stay, room type, breakfast info, room category, domain
• with more information
• Net & gross price, city tax, resort fee, affiliate fee, VAT
10. 10
Present data pipeline 2016 – facts &
figures
Cluster specifications
- 24 machines (to be 48)
- 700 TB disc space, 60%
used
- 1.7 TB memory in Yarn
- 24 cores per machine, 864
VCores in total
(overprovisioning)
Data Size (price log)
- 1.2 trillion messages
collected so far
- 6 billion messages/day
- 44 TB of data
Data processing
- Camus: 30 mappers writing
data in 10 minute intervals
- First aggregation/filtering
stage in Hive runs in 30
minutes with 5 days of
CPU time spent
- Impala Queries across
>100 GB of result tables
usually done within a few
seconds
11. 11
Present data pipeline 2016 – results after one
year in production
• Very reliable, no downtime or service interuptions of the system
• Java team is very happy – less load on their system
• BI team is very happy – more data, more ressources to process it
• CMC team is very happy
• Faster results
• Better quality of results due to more data
• More detailed results
• => Shorter research phase, more and better stories
• => Less requests to BI
12. 12
Present data pipeline 2016 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
Status quo
- Our entire business runs
on and through the kafka –
hadoop pipeline
- Almost all departments rely
on data, insights and
metrics delivered by
hadoop
- Most of the company could
not do their job without
hadoop data
14. 14
Key challenges and learnings
Mastering hadoop
- Finding your log files
- Interpreting error
messages correctly
- Understanding settings
and how to use them to
solve problem
- Store data in wide,
denormalised Hive tables
in parquet format and
nested data types
Using hadoop
- Offer easy hadoop access
to users (Impala / Hive
JDBC with visualisation
tools)
- Educate users on how to
write good code, strict
guidelines and review
- Load balancer for Impala
- Daily dump of fs image as
hive table
Bad parts
- HUE (the standard GUI)
- Write oozie workflows and
coordinators in xml, not
through the Hue interface
- Monitoring impala
- Still some hard to find bugs
in Hive & Impala
21. Email: clemens.valiente@trivago.com
Phone: +49 (0) 211 540 65 - 137
de.linkedin.com/in/clemensvaliente
Lead BI Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
Thank you!
Now it is
time for
questions
and
comments!