SlideShare a Scribd company logo
Scaling self service
on Hadoop
Sander Kieft
@skieft
Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/
About me
@skieft
 Manager Core Services at Sanoma
 Responsible for all common services, including the Big Data platform
 Work:
– Centralized services
– Data platform
– Search
 Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff
24April20153
Sanoma, Publishing and Learning company
2+100
2 Finnish newspapers
Over 100 magazines
24April2015 Presentationname4
5
TV channels in Finland
and The Netherlands
200+
Websites
100
Mobile applications on
various mobile platforms
Past
History
< 2008 2009 2010 2011 2012 2013 2014 2015
Self service
Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/
Self service
Self service levels
Personal Departmental Corporate
Full self service Support with publishing dashboards
and data loading
Full service and support on
dashboard
Information is created by end users with
little or no oversight. Users are
empowered to integrate different data
sources and make their own
calculations.
Information has been created by end
users and is worth sharing, but has not
been validated.
Information that has gone through a
rigorous validation process can be
disseminated as official data.
24April20159
Information Workers Information Consumers
Full Agility Centralized Development
Excel Static reportsFocus
Self service position – 2010 starting point
Source Extraction
Trans-
formation
Modeling Load
Report /
Dashboar
d
Insight
24April201510
Data team Analysts
History
< 2008 2009 2010 2011 2012 2013 2014 2015
4/24/2015 © SanomaMedia11
Glue: ETL
 EXTRACT
 TRANSFORM
 LOAD
24April201512
24April2015 Presentationname13
24April2015 Presentationname14
#fail
 Hadoop and Qlikview proofed to be really valuable tools
 Traditional ETL tools don’t scale for Big Data sources
 Big Data projects are not BI projects
 Doing full end-to-end integrations and dashboard development doesn’t scale
 Qlikview was not good enough as the front-end to the cluster
 Hadoop requires developers not BI consultants
Learnings
History
< 2008 2009 2010 2011 2012 2013 2014 2015
Russell Jurney
Agile Data
24.4.2015 © SanomaMedia Presentationname/Author21
Self service position – Agile Data
Source Extraction
Trans-
formation
Modeling Load
Report /
Dashboar
d
Insight
24April201522
New glue
Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/
ETL Tool features
 Processing
 Scheduling
 Data quality
 Data lineage
 Versioning
 Annotating
24April201524
24April2015 Presentationname25
Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/
24.4.2015 © SanomaMedia Presentationname/Author26
24.4.2015 © SanomaMedia Presentationname/Author27
Processing - Jython
 No JVM startup overhead for Hadoop API usage
 Relatively concise syntax (Python)
 Mix Python standard library with any Java libs
24April2015 Presentationname28
Scheduling - Jenkins
 Flexible scheduling with dependencies
 Saves output
 E-mails on errors
 Scales to multiple nodes
 REST API
 Status monitor
 Integrates with version control
24April2015 Presentationname29
ETL Tool features
 Processing – Bash & Jython
 Scheduling – Jenkins
 Data quality
 Data lineage
 Versioning – Mecurial (hg)
 Annotating – Commenting the code 
24April2015 Presentationname30
Processes
Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/
Independent jobs
Source (external)
HDFS upload + move in place
Staging (HDFS)
MapReduce + HDFS move
Hive-staging (HDFS)
Hive map external table + SELECT INTO
Hive
24April2015 Presentationname32
Typical data flow - Extract
Jenkins QlikViewHadoop
HDFS
Hadoop
M/R or Hive
1. Job code from
hg (mercurial)
2. Get the data from
S3, FTP or API
3. Store the raw source
data on HDFS
Typical data flow - Transform
Jenkins QlikViewHadoop
HDFS
Hadoop
M/R or Hive
1. Job code from
hg (mercurial)
2. Execute M/R or
Hive Query
3. Transform the data to the
intermediate structure
Typical data flow - Load
Jenkins QlikViewHadoop
HDFS
Hadoop
M/R or Hive
1. Job code from
hg (mercurial)
2. Execute Hive
Query
3. Load data to QlikView
Out of order jobs
 At any point, you don’t really know what ‘made it’ to Hive
 Will happen anyway, because some days the data delivery is going to be three
hours late
 Or you get half in the morning and the other half later in the day
 It really depends on what you do with the data
 This is where metrics + fixable data store help...
24April2015 Presentationname36
Fixable data store
 Using Hive partitions
 Jobs that move data from staging create partitions
 When new data / insight about the data arrives, drop the partition and re-insert
 Be careful to reset any metrics in this case
 Basically: instead of trying to make everything transactional, repair afterwards
 Use metrics to determine whether data is fit for purpose
24April2015 Presentationname37
Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/
Enabling self service
24.4.2015 © SanomaMedia39
??
Netherlands
Finland
Staging Raw Data
(Native format)
HDFS
Normalized Data
structure
HIVE
Reusable
stem data
• Weather
• Int. event
calanders
• Int. AdWord
Prices
• Exchange
rates
Source
systems
FI
NL
?
Extract Transform
Data
Scientists
Business
Analists
Enabling self service
24.4.2015 © SanomaMedia40
??
Netherlands
Finland
Staging Raw Data
(Native format)
HDFS
Normalized Data
structure
HIVE
Reusable
stem data
• Weather
• Int. event
calanders
• Int. AdWord
Prices
• Exchange
rates
Source
systems
FI
NL
?
Extract Transform
Data
Scientists
Business
Analists
HUE & Hive
QlikView
R Studio Server
Jenkins
Architecture
High Level Architecture
46
sources
Jenkins & Jython
ETL
Back Office
Analytics
Meta Data
AdServing
(incl. mobile, video,cpc,yield,
etc.)
Market
Content
QlikView
Dashboard/
Reporting
Hive
&
HUE
Subscription
Hadoop
Scheduled
loads
AdHoc
Queries
Scheduled
exports
R Studio
Advanced
Analytics
Current State - Real time
 Extending own collecting infrastructure
 Using this to drive recommendations, real time dashboarding, user segementation
and targeting in real time
 Using Kafka and Storm
24April2015 Presentationname47
High Level Architecture
48
Jenkins & Jython
ETL
Meta Data
QlikView
Dashboard/
Reporting
Hive
&
HUE
(Real time)
Collecting
CLOE
Hadoop
Scheduled
loads
AdHoc
Queries
Scheduled
exports
Recommendations /
Learn to Rank
R Studio
Advanced
Analytics
Storm
Stream
processing
Recommendatio
ns & online
learningFlume / Kafka
Back Office
Analytics
AdServing
(incl. mobile, video,cpc,yield,
etc.)
Market
Content
sources
Subscription
DC1 DC2
Sanoma Media The Netherlands Infra
Production VMs
Managed HostingColocation
Big Data platform
Dev., test and
acceptance VMs
Production VMs
• NO SLA (yet)
• Limited support BD9-5
• No Full system backups
• Managed by us with systems department
help
• 99,98% availability
• 24/7 support
• Multi DC
• Full system backups
• High performance EMC SAN storage
• Managed by dedicated systems
department
DC1 DC2
Sanoma Media The Netherlands Infra
Production VMs
Managed HostingColocation
Big Data platform
Dev., test and
acceptance VMs
Production VMs
• NO SLA (yet)
• Limited support BD9-5
• No Full system backups
• Managed by us with systems department
help
• 99,98% availability
• 24/7 support
• Multi DC
• Full system backups
• High performance EMC SAN storage
• Managed by dedicated systems
department
Processing
Storage
Batch
Collecting
Real time
Recommendations
Queue
data for 3
days
Run on
stale data
for 3 days
Present
Self service position – Present
Source Extraction
Trans-
formation
Modeling Load
Report /
Dashboar
d
Insight
24April2015 Presentationname52
Present
< 2008 2009 2010 2011 2012 2013 2014 2015
YARN
Moving to YARN
(CDH 4.3 -> 5.1)
24April2015 "Blastingfrankfurt"byHeptagon-Ownwork.LicensedunderCCBY-SA3.0viaWikimediaCommons-http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54
 Functional testing wasn’t enough
 Takes time to tune the parameters
 Defaults are NOT good enough
 Cluster dead locks
 Grouped nodes with similar Hardware requirements into Node groups
CDH 4 to 5 specific
 Hive CLI -> Beeline
 Hive Locking
 Combining various data sources (web analytics + advertising + user profiles)
 A/B testing + deeper analyses
 Ad spend ROI optimization & attribution
 Ad auction price optimization
 Ad retargeting
 Item and user based recommendations
 Search optimizations
– Online Learn to Rank
– Search suggestions
Current state – Use case
24 April2015 Presentationname55
Current state - Usage
 Main use case for reporting and analytics, increasing data science workloads
 Sanoma standard data platform, used in all Sanoma countries
 > 250 daily dashboard users
 40 daily users: analysts & developers
 43 source systems, with 125 different sources
 400 tables in hive
24April2015 Presentationname56
Current state – Tech and team
 Platform Team:
– 1 product owner
– 3 developers
 Close collaborators:
– ~10 data scientists
– 1 Qlikview application manager
– ½ (system) architect
 Platform:
– 50-60 nodes
– > 600TB storage
– ~3000 jobs/day
 Typical nodes
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM
24April2015 Presentationname57
Challenges
Photo credits: http://kdrunlimited1.blogspot.hu/2013/01/late-edition-squirrel-appreciation-day.html
1. Security
2. Increase SLA’s
3. Integration with other BI environments
4. Improve Self service
5. Data Quality
6. Code reuse
Challenges
24 April2015 Presentationname59
Future
 Better integration batch and real time (unified code base)
 Improve the availability and SLA’s of the platform
 Optimizing Job scheduling (Resource Pools)
 Automated query/job optimization tips for analysts
 Fix Jenkins, is single point of failure
 Moving some NLP (Natural Language Processing) and Image recognition
workload to hadoop
24April2015 Presentationname61
What’s next
Thank you!
Questions?
@skieft
sander.kieft@sanoma.com
Hadoop Summit - Sanoma self service on hadoop

More Related Content

What's hot

Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Audi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid CloudAudi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid Cloud
DataWorks Summit
 
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
The Elephant in the Clouds
The Elephant in the CloudsThe Elephant in the Clouds
The Elephant in the Clouds
DataWorks Summit/Hadoop Summit
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
VMware Tanzu
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
Building Audi’s enterprise big data platform
Building Audi’s enterprise big data platformBuilding Audi’s enterprise big data platform
Building Audi’s enterprise big data platform
DataWorks Summit
 
KNIME tutorial
KNIME tutorialKNIME tutorial
KNIME tutorial
George Papadatos
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseSmart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
DataWorks Summit
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
DataWorks Summit
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 

What's hot (20)

Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
 
Audi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid CloudAudi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid Cloud
 
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
The Elephant in the Clouds
The Elephant in the CloudsThe Elephant in the Clouds
The Elephant in the Clouds
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
Building Audi’s enterprise big data platform
Building Audi’s enterprise big data platformBuilding Audi’s enterprise big data platform
Building Audi’s enterprise big data platform
 
KNIME tutorial
KNIME tutorialKNIME tutorial
KNIME tutorial
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseSmart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 

Similar to Hadoop Summit - Sanoma self service on hadoop

Scaling self service on Hadoop
Scaling self service on HadoopScaling self service on Hadoop
Scaling self service on Hadoop
DataWorks Summit
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
Hortonworks
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Impetus Technologies
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
VMware Tanzu
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
 
451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps
Delphix
 
NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015
Val Bercovici
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
DataWorks Summit
 
Use Cases for NoSQL in Media
Use Cases for NoSQL in MediaUse Cases for NoSQL in Media
Use Cases for NoSQL in Media
Sander Kieft
 
Self-service data discovery for business users and analysts using SAP Lumira
Self-service data discovery for business users and analysts using SAP LumiraSelf-service data discovery for business users and analysts using SAP Lumira
Self-service data discovery for business users and analysts using SAP Lumira
SAP Analytics
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications
Hortonworks
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
HostedbyConfluent
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
Craig Jordan
 

Similar to Hadoop Summit - Sanoma self service on hadoop (20)

Scaling self service on Hadoop
Scaling self service on HadoopScaling self service on Hadoop
Scaling self service on Hadoop
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps
 
NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
 
Use Cases for NoSQL in Media
Use Cases for NoSQL in MediaUse Cases for NoSQL in Media
Use Cases for NoSQL in Media
 
Self-service data discovery for business users and analysts using SAP Lumira
Self-service data discovery for business users and analysts using SAP LumiraSelf-service data discovery for business users and analysts using SAP Lumira
Self-service data discovery for business users and analysts using SAP Lumira
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 

Recently uploaded

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Hadoop Summit - Sanoma self service on hadoop

  • 1. Scaling self service on Hadoop Sander Kieft @skieft
  • 2. Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/
  • 3. About me @skieft  Manager Core Services at Sanoma  Responsible for all common services, including the Big Data platform  Work: – Centralized services – Data platform – Search  Like: – Work – Water(sports) – Whiskey – Tinkering: Arduino, Raspberry PI, soldering stuff 24April20153
  • 4. Sanoma, Publishing and Learning company 2+100 2 Finnish newspapers Over 100 magazines 24April2015 Presentationname4 5 TV channels in Finland and The Netherlands 200+ Websites 100 Mobile applications on various mobile platforms
  • 6. History < 2008 2009 2010 2011 2012 2013 2014 2015
  • 7. Self service Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/
  • 9. Self service levels Personal Departmental Corporate Full self service Support with publishing dashboards and data loading Full service and support on dashboard Information is created by end users with little or no oversight. Users are empowered to integrate different data sources and make their own calculations. Information has been created by end users and is worth sharing, but has not been validated. Information that has gone through a rigorous validation process can be disseminated as official data. 24April20159 Information Workers Information Consumers Full Agility Centralized Development Excel Static reportsFocus
  • 10. Self service position – 2010 starting point Source Extraction Trans- formation Modeling Load Report / Dashboar d Insight 24April201510 Data team Analysts
  • 11. History < 2008 2009 2010 2011 2012 2013 2014 2015 4/24/2015 © SanomaMedia11
  • 12. Glue: ETL  EXTRACT  TRANSFORM  LOAD 24April201512
  • 15.
  • 16.
  • 17. #fail
  • 18.  Hadoop and Qlikview proofed to be really valuable tools  Traditional ETL tools don’t scale for Big Data sources  Big Data projects are not BI projects  Doing full end-to-end integrations and dashboard development doesn’t scale  Qlikview was not good enough as the front-end to the cluster  Hadoop requires developers not BI consultants Learnings
  • 19. History < 2008 2009 2010 2011 2012 2013 2014 2015
  • 20.
  • 21. Russell Jurney Agile Data 24.4.2015 © SanomaMedia Presentationname/Author21
  • 22. Self service position – Agile Data Source Extraction Trans- formation Modeling Load Report / Dashboar d Insight 24April201522
  • 23. New glue Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/
  • 24. ETL Tool features  Processing  Scheduling  Data quality  Data lineage  Versioning  Annotating 24April201524
  • 25. 24April2015 Presentationname25 Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/
  • 26. 24.4.2015 © SanomaMedia Presentationname/Author26
  • 27. 24.4.2015 © SanomaMedia Presentationname/Author27
  • 28. Processing - Jython  No JVM startup overhead for Hadoop API usage  Relatively concise syntax (Python)  Mix Python standard library with any Java libs 24April2015 Presentationname28
  • 29. Scheduling - Jenkins  Flexible scheduling with dependencies  Saves output  E-mails on errors  Scales to multiple nodes  REST API  Status monitor  Integrates with version control 24April2015 Presentationname29
  • 30. ETL Tool features  Processing – Bash & Jython  Scheduling – Jenkins  Data quality  Data lineage  Versioning – Mecurial (hg)  Annotating – Commenting the code  24April2015 Presentationname30
  • 31. Processes Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/
  • 32. Independent jobs Source (external) HDFS upload + move in place Staging (HDFS) MapReduce + HDFS move Hive-staging (HDFS) Hive map external table + SELECT INTO Hive 24April2015 Presentationname32
  • 33. Typical data flow - Extract Jenkins QlikViewHadoop HDFS Hadoop M/R or Hive 1. Job code from hg (mercurial) 2. Get the data from S3, FTP or API 3. Store the raw source data on HDFS
  • 34. Typical data flow - Transform Jenkins QlikViewHadoop HDFS Hadoop M/R or Hive 1. Job code from hg (mercurial) 2. Execute M/R or Hive Query 3. Transform the data to the intermediate structure
  • 35. Typical data flow - Load Jenkins QlikViewHadoop HDFS Hadoop M/R or Hive 1. Job code from hg (mercurial) 2. Execute Hive Query 3. Load data to QlikView
  • 36. Out of order jobs  At any point, you don’t really know what ‘made it’ to Hive  Will happen anyway, because some days the data delivery is going to be three hours late  Or you get half in the morning and the other half later in the day  It really depends on what you do with the data  This is where metrics + fixable data store help... 24April2015 Presentationname36
  • 37. Fixable data store  Using Hive partitions  Jobs that move data from staging create partitions  When new data / insight about the data arrives, drop the partition and re-insert  Be careful to reset any metrics in this case  Basically: instead of trying to make everything transactional, repair afterwards  Use metrics to determine whether data is fit for purpose 24April2015 Presentationname37
  • 38. Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/
  • 39. Enabling self service 24.4.2015 © SanomaMedia39 ?? Netherlands Finland Staging Raw Data (Native format) HDFS Normalized Data structure HIVE Reusable stem data • Weather • Int. event calanders • Int. AdWord Prices • Exchange rates Source systems FI NL ? Extract Transform Data Scientists Business Analists
  • 40. Enabling self service 24.4.2015 © SanomaMedia40 ?? Netherlands Finland Staging Raw Data (Native format) HDFS Normalized Data structure HIVE Reusable stem data • Weather • Int. event calanders • Int. AdWord Prices • Exchange rates Source systems FI NL ? Extract Transform Data Scientists Business Analists
  • 46. High Level Architecture 46 sources Jenkins & Jython ETL Back Office Analytics Meta Data AdServing (incl. mobile, video,cpc,yield, etc.) Market Content QlikView Dashboard/ Reporting Hive & HUE Subscription Hadoop Scheduled loads AdHoc Queries Scheduled exports R Studio Advanced Analytics
  • 47. Current State - Real time  Extending own collecting infrastructure  Using this to drive recommendations, real time dashboarding, user segementation and targeting in real time  Using Kafka and Storm 24April2015 Presentationname47
  • 48. High Level Architecture 48 Jenkins & Jython ETL Meta Data QlikView Dashboard/ Reporting Hive & HUE (Real time) Collecting CLOE Hadoop Scheduled loads AdHoc Queries Scheduled exports Recommendations / Learn to Rank R Studio Advanced Analytics Storm Stream processing Recommendatio ns & online learningFlume / Kafka Back Office Analytics AdServing (incl. mobile, video,cpc,yield, etc.) Market Content sources Subscription
  • 49. DC1 DC2 Sanoma Media The Netherlands Infra Production VMs Managed HostingColocation Big Data platform Dev., test and acceptance VMs Production VMs • NO SLA (yet) • Limited support BD9-5 • No Full system backups • Managed by us with systems department help • 99,98% availability • 24/7 support • Multi DC • Full system backups • High performance EMC SAN storage • Managed by dedicated systems department
  • 50. DC1 DC2 Sanoma Media The Netherlands Infra Production VMs Managed HostingColocation Big Data platform Dev., test and acceptance VMs Production VMs • NO SLA (yet) • Limited support BD9-5 • No Full system backups • Managed by us with systems department help • 99,98% availability • 24/7 support • Multi DC • Full system backups • High performance EMC SAN storage • Managed by dedicated systems department Processing Storage Batch Collecting Real time Recommendations Queue data for 3 days Run on stale data for 3 days
  • 52. Self service position – Present Source Extraction Trans- formation Modeling Load Report / Dashboar d Insight 24April2015 Presentationname52
  • 53. Present < 2008 2009 2010 2011 2012 2013 2014 2015 YARN
  • 54. Moving to YARN (CDH 4.3 -> 5.1) 24April2015 "Blastingfrankfurt"byHeptagon-Ownwork.LicensedunderCCBY-SA3.0viaWikimediaCommons-http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54  Functional testing wasn’t enough  Takes time to tune the parameters  Defaults are NOT good enough  Cluster dead locks  Grouped nodes with similar Hardware requirements into Node groups CDH 4 to 5 specific  Hive CLI -> Beeline  Hive Locking
  • 55.  Combining various data sources (web analytics + advertising + user profiles)  A/B testing + deeper analyses  Ad spend ROI optimization & attribution  Ad auction price optimization  Ad retargeting  Item and user based recommendations  Search optimizations – Online Learn to Rank – Search suggestions Current state – Use case 24 April2015 Presentationname55
  • 56. Current state - Usage  Main use case for reporting and analytics, increasing data science workloads  Sanoma standard data platform, used in all Sanoma countries  > 250 daily dashboard users  40 daily users: analysts & developers  43 source systems, with 125 different sources  400 tables in hive 24April2015 Presentationname56
  • 57. Current state – Tech and team  Platform Team: – 1 product owner – 3 developers  Close collaborators: – ~10 data scientists – 1 Qlikview application manager – ½ (system) architect  Platform: – 50-60 nodes – > 600TB storage – ~3000 jobs/day  Typical nodes – 1-2 CPU 4-12 cores – 2 system disks (RAID 1) – 4 data disks (2TB, 3TB or 4TB) – 24-32GB RAM 24April2015 Presentationname57
  • 59. 1. Security 2. Increase SLA’s 3. Integration with other BI environments 4. Improve Self service 5. Data Quality 6. Code reuse Challenges 24 April2015 Presentationname59
  • 61.  Better integration batch and real time (unified code base)  Improve the availability and SLA’s of the platform  Optimizing Job scheduling (Resource Pools)  Automated query/job optimization tips for analysts  Fix Jenkins, is single point of failure  Moving some NLP (Natural Language Processing) and Image recognition workload to hadoop 24April2015 Presentationname61 What’s next

Editor's Notes

  1. Poultery
  2. Mainly webanalytics Forester 1 dev + 1 bi
  3. Business Intelligence Self Service Sweet Spot
  4. ETL Pentaho SAS DI Studio Informatica Oozie Other
  5. - cluster deadlocks can appear in moments of high utilization: low reducer slowstart threshold can result in a situation of many jobs with all reducers already allocated, taking up the slots for the remaining mappers which still need to finish ; apparently the Fair scheduler is not smart enough to detect this => all jobs are hanging (deadlock)           solution: increase reducer slowstart threshold (mapreduce.job.reduce.slowstart.completedmaps) to 0.99