SlideShare a Scribd company logo
Make the elephant fly,
once again
Luangsay Sourygna
sluangsay@bol.com
• Bol.com
• Data Quality
• Hive_compared_bq
• DataPrep
• PII data
• Should we lift & shift?
Agenda
Bol.com
• Number 1 webshop in Netherlands & Belgium
• Yes, in Spain you’ve also heard about us:
Winner of the Entrepreneurial Award
This year, the winner is Bol.com from The
Netherlands
(Barcelona, 2014)
It all began in 2008…
S0 …
How big is
bol.com?
Really?
> 16 million products
for sale
> 50 million in catalog
Hadoop in production
since 2010
> 6 million active
customers
> 40 million visits per
month
> 5000 million
pageviews/ye
ar
Hadoop at bol.com
• On Premise Production cluster = 35 nodes
• 30+ IT teams
• Several Business Teams
But lot of challenges…
• Lack of flexibility:
• Version HDP
• Christmas’ peak
• Security issues
• Who likes Kerberos?
• No PII
• We’re overloaded:
• Sysops
• YARN
Let’s go to the Cloud
Moving data seems easy
Then: migrating ETL…
Difficult in Big Data…
Same challenge in 2015
• Hortonworks’ project (audit):
• SQL server -> Hive
• Huge ETL process
• To test: 500MB – 60GB databases
• 1st tests:
• #rows
• Manually reading few lines
• Developed 3 validation tools
• Ex: https://
community.hortonworks.com/articles/1283/hive-script-to-validate-tabl
es-compare-one-with-an.html
hive_compared_bq
• Improved validation tool:
• No Sqoop
• Totally scalable (moving only aggregated checksums)
• Now: & (also …)
• https://github.com/bolcom/hive_compared_bq
Improving test coverage
• Not only for ETL migration
• Also for “integration” test
• Testing tools:
• Unit tests: quick, automated
• Hive_compared_bq: huge tests, capturing outliers
Data Quality with DataPrep
Yet another BigData processing tool…
• input:
• “Intelligent, visual & serverless data preparation”
• Business People focused
• “Excell” like
• See data + statistics
• Just push “Run” button…
• Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
Why Data Quality?
• For IT: better understanding of data
• See & “feel” data
• Help before developing
• Stats: outliers, skew
• For Business:
• Before: 1st
analysis  better Jiras
• After: validate with stats
Beware of the dog
Security in our Hadoop
• Kerberos issues:
• Integration Java libraries
• REST interfaces not secured
• Not encrypted
• No strong audit
• No strict HDFS, HBase, Hive permissions
PII in BigQuery!
• Always encrypted: disk + network
• Serverless: patching on time
• No Kerberos
• Central access control
• Hide PII columns with views
• Advanced logging
Example of logs
Migrating: which strategy?
Initial idea: many Dataproc
• “a Hadoop cluster in 90 seconds”
• 1 cluster for each team (per job?)
• Capacity issue
• Configuration issue
• Easy migration: no rewrite of MR, Pig, Hive, Flink…
HBase  BigTable
• HBase replication hooked to BigTable
Finally, no lift & shift
• Use of HBase is not optimal
• Better using native Google Tools:
• Cheaper
• Faster
Hive vs BigQuery
• “Now with BigQuery, I start to think that a query is
slow when it takes more than 30 seconds”
Alternatives to Pig
• BigQuery
• Beam
Flexibility of Beam
• Easy for future migration:
• In our Hadoop
• In Google Cloud
• Or maybe Beam in Cloud = +
• Overhead of DataFlow = 2 mn, similar to Dataproc
• Flink = better metrics + debugging
Netherlands…
• Want to discover what it feels like living 4 meters
under the sea level?
• Yes… We’re hiring!
Thanks
till next bol.com
Sourygna Luangsay
sluangsay@bol.com

More Related Content

Similar to Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Caserta
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged Conference
Ani Lopez
 
2015 WritersUA Sourcing Graphics
2015 WritersUA Sourcing Graphics2015 WritersUA Sourcing Graphics
2015 WritersUA Sourcing Graphics
Mary Connor
 
Retail & CPG
Retail & CPGRetail & CPG
Real Time BI with Hadoop
Real Time BI with HadoopReal Time BI with Hadoop
Real Time BI with Hadoop
Bradford Stephens
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
GIAF
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
SATOSHI TAGOMORI
 
Visualizing Library Data
Visualizing Library DataVisualizing Library Data
Visualizing Library Data
Bohyun Kim
 
Building a Data Driven Company
Building a Data Driven CompanyBuilding a Data Driven Company
Building a Data Driven Company
Maciej Mróz
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
Praveen Kumar (Tyagi)
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
Rob Winters
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Big data at zulily
Big data at zulilyBig data at zulily
Big data at zulily
zulily
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
Zohar Elkayam
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
Peter Skomoroch
 
Advanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scaleAdvanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scale
Ani Lopez
 
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headacheTips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Mike Maadarani
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 

Similar to Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017 (20)

To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged Conference
 
2015 WritersUA Sourcing Graphics
2015 WritersUA Sourcing Graphics2015 WritersUA Sourcing Graphics
2015 WritersUA Sourcing Graphics
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Real Time BI with Hadoop
Real Time BI with HadoopReal Time BI with Hadoop
Real Time BI with Hadoop
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
 
Visualizing Library Data
Visualizing Library DataVisualizing Library Data
Visualizing Library Data
 
Building a Data Driven Company
Building a Data Driven CompanyBuilding a Data Driven Company
Building a Data Driven Company
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big data at zulily
Big data at zulilyBig data at zulily
Big data at zulily
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
Advanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scaleAdvanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scale
 
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headacheTips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headache
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 

Recently uploaded

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 

Recently uploaded (20)

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

  • 1.
  • 2. Make the elephant fly, once again Luangsay Sourygna sluangsay@bol.com
  • 3. • Bol.com • Data Quality • Hive_compared_bq • DataPrep • PII data • Should we lift & shift? Agenda
  • 4. Bol.com • Number 1 webshop in Netherlands & Belgium • Yes, in Spain you’ve also heard about us: Winner of the Entrepreneurial Award This year, the winner is Bol.com from The Netherlands (Barcelona, 2014)
  • 5. It all began in 2008…
  • 6. S0 … How big is bol.com? Really?
  • 7. > 16 million products for sale > 50 million in catalog Hadoop in production since 2010 > 6 million active customers > 40 million visits per month > 5000 million pageviews/ye ar
  • 8. Hadoop at bol.com • On Premise Production cluster = 35 nodes • 30+ IT teams • Several Business Teams
  • 9. But lot of challenges… • Lack of flexibility: • Version HDP • Christmas’ peak • Security issues • Who likes Kerberos? • No PII • We’re overloaded: • Sysops • YARN
  • 10. Let’s go to the Cloud
  • 13. Difficult in Big Data…
  • 14. Same challenge in 2015 • Hortonworks’ project (audit): • SQL server -> Hive • Huge ETL process • To test: 500MB – 60GB databases • 1st tests: • #rows • Manually reading few lines • Developed 3 validation tools • Ex: https:// community.hortonworks.com/articles/1283/hive-script-to-validate-tabl es-compare-one-with-an.html
  • 15. hive_compared_bq • Improved validation tool: • No Sqoop • Totally scalable (moving only aggregated checksums) • Now: & (also …) • https://github.com/bolcom/hive_compared_bq
  • 16. Improving test coverage • Not only for ETL migration • Also for “integration” test • Testing tools: • Unit tests: quick, automated • Hive_compared_bq: huge tests, capturing outliers
  • 17. Data Quality with DataPrep
  • 18. Yet another BigData processing tool… • input: • “Intelligent, visual & serverless data preparation” • Business People focused • “Excell” like • See data + statistics • Just push “Run” button… • Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
  • 19.
  • 20. Why Data Quality? • For IT: better understanding of data • See & “feel” data • Help before developing • Stats: outliers, skew • For Business: • Before: 1st analysis  better Jiras • After: validate with stats
  • 22. Security in our Hadoop • Kerberos issues: • Integration Java libraries • REST interfaces not secured • Not encrypted • No strong audit • No strict HDFS, HBase, Hive permissions
  • 23. PII in BigQuery! • Always encrypted: disk + network • Serverless: patching on time • No Kerberos • Central access control • Hide PII columns with views • Advanced logging
  • 26. Initial idea: many Dataproc • “a Hadoop cluster in 90 seconds” • 1 cluster for each team (per job?) • Capacity issue • Configuration issue • Easy migration: no rewrite of MR, Pig, Hive, Flink…
  • 27. HBase  BigTable • HBase replication hooked to BigTable
  • 28. Finally, no lift & shift • Use of HBase is not optimal • Better using native Google Tools: • Cheaper • Faster
  • 29. Hive vs BigQuery • “Now with BigQuery, I start to think that a query is slow when it takes more than 30 seconds”
  • 30. Alternatives to Pig • BigQuery • Beam
  • 31. Flexibility of Beam • Easy for future migration: • In our Hadoop • In Google Cloud • Or maybe Beam in Cloud = + • Overhead of DataFlow = 2 mn, similar to Dataproc • Flink = better metrics + debugging
  • 32. Netherlands… • Want to discover what it feels like living 4 meters under the sea level? • Yes… We’re hiring!
  • 33. Thanks till next bol.com Sourygna Luangsay sluangsay@bol.com