Spark Summit EU talk by Emlyn Whittick

•

2 likes•738 views

Spark Summit

Talk Data to Me: Sparking Insights at Elsevier

Data & Analytics

Elsevier
• Founded in 1880 (136 years old)
• Started life as a traditional
publisher
• Specialised in scientific and
medical publishing
• Now an information solutions
provider

LEAD THE WAY IN ADVANCING
SCIENCE, TECHNOLOGY AND
HEALTH
Elsevier’s Mission

Collecting Ingredients
Standardised, automated
collection and delivery
Proprietary, mostly automated
mechanisms
Manual, ad-hoc collection and
delivery

An Initial Spark
• Use Case: Application of NLP on Elsevier’s
entire body of published content
• Databricks selected for:
– Mounting of data
– Minimal operational overhead
– Presentation of results

An Initial Spark
• Databricks used by team of 15
people for content analytics
• Spark adopted as the main
processing engine for
Elsevier’s big data platform

An Initial Spark
• Spark 2.0 and Spark 1.6
• RDD to DataFrame, and now to
DataSet
• Production workflows in Scala
• Data Science and Analytics
workflows may be in Python or R

A Recipe for Insights
• Empower the master chefs
• Prepare the ingredients
• Share pre-prepared and cooked dishes
• Serve delicious dishes

CSV Potato
• Old Classic
• All shapes and sizes
• Often dirty
• Sometimes it’s a
sweet potato

XML Onion
• Lots of nested layers
• Sometimes it makes
you cry
• Pretty much everyone
has some somewhere

Delivering the Groceries
• Large suppliers have well established delivery
mechanisms
• The local store may be less reliable
• The local farm even less so

Sparking Data Preparation
• 200 million article abstracts
• Stored in Amazon S3
• XML files named by ID
• Data delivery notifications via SNS
• Variable file sizes (kB to MB)

Sparking Data Preparation
• Skewed key distribution made processing
difficult
• Data pre-processing using Spark Streaming
• Data processing in batch using Spark and
Parquet as an output format

Cooking our Ingredients
• Focus on sharing cooked ingredients
• Cooking in batches has its advantages
– Verifying consistency
– Testing and releasing batches
– Recoverability

Cooking with Spark
• Our data processing is exclusively done in Spark
• Cooking is mainly done in batch but we’re
looking at more and more streaming
• Most synthesized batch datasets stored as
Parquet

Cooking Citations
• 200 million article abstracts in XML format
• Calculation of article citations by article and
author
• Transformed to key value JSON for REST API
• Mounted and shared in Databricks

Self Service with Databricks
• Data from the data lake
mounted in DBFS
• Single shared cluster
provided for general use
• Bespoke clusters for
heavy workloads
0
10
20
30
40
50
60
70
80
Databricks Users at Elsevier

Databricks Use Cases
• Author relationships and graphs
• Author disambiguation research
• Article recommendations
• Data profiling and exploration
• Learning

Our Road Ahead
• Fine-grained data access, privacy and security
• Data discovery and provenance
• Data cleansing and classification
• Enhanced operational support

SPARK SUMMIT
EUROPE2016
THANK YOU.
e.whittick@elsevier.com

HPCC
• Developed for processing public records
• Limited flexibility
• Limited support and community
• High barrier to entry
• Difficult to get back to the original data
• Not tolerant to faults

What's hot

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Spark Summit EU talk by Jim DowlingSpark Summit

Spark - The Ultimate Scala Collections by Martin OderskySpark Summit

NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinZenika

Distributed ML in Apache SparkDatabricks

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks

Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit

Data Science with Spark & ZeppelinVinay Shukla

Mobius: C# Language Binding For SparkSpark Summit

Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzDatabricks

Spark Summit EU talk by Oscar CastanedaSpark Summit

Apache Spark Usage in the Open Source EcosystemDatabricks

Spark Summit EU talk by Berni SchieferSpark Summit

Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit

Introduction to Apache Spark and MLlibpumaranikar

Whirlpools in the Stream with Jayesh LalwaniDatabricks

What's hot (20)

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Summit EU talk by Jim Dowling

Spark - The Ultimate Scala Collections by Martin Odersky

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Distributed ML in Apache Spark

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...

Building a Business Logic Translation Engine with Spark Streaming for Communi...

Data Science with Spark & Zeppelin

Mobius: C# Language Binding For Spark

Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz

Spark Summit EU talk by Oscar Castaneda

Apache Spark Usage in the Open Source Ecosystem

Spark Summit EU talk by Berni Schiefer

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...

Introduction to Apache Spark and MLlib

Whirlpools in the Stream with Jayesh Lalwani

Viewers also liked

Spark Summit EU talk by Johnathan MercerSpark Summit

Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...Spark Summit

Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit

Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit

Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit

Spark Summit EU talk by Stephan KesslerSpark Summit

Spark Summit EU talk by Pat PattersonSpark Summit

Using Apache Spark for Intelligent Services by Alexis RoosSpark Summit

Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit

Spark Summit EU talk by Tug GrallSpark Summit

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit

Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Spark Summit EU talk by Shaun Klopfenstein and Neelesh ShastrySpark Summit

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit

Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Spark Summit

Viewers also liked (20)

Spark Summit EU talk by Johnathan Mercer

Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Stephan Kessler

Spark Summit EU talk by Pat Patterson

Using Apache Spark for Intelligent Services by Alexis Roos

Spark Summit EU talk by Ruben Pulido and Behar Veliqi

Spark Summit EU talk by Tug Grall

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...

Debugging PySpark: Spark Summit East talk by Holden Karau

Spark Summit EU talk by Miklos Christine paddling up the stream

Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...

Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...

Similar to Spark Summit EU talk by Emlyn Whittick

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan

Drupal and Apache StanbolAlkuvoima

Spark Resource ManagerShad Amez

If You Have The Content, Then Apache Has The Technology!gagravarr

Serverless sparkMamathaBusi

Data Science at Scale with Apache Spark and Zeppelin NotebookCarolyn Duby

SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker

So You Want to Build a Data Lake?David P. Moore

Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys

Accelerating Delivery of Data Products - The EBSCO WayMongoDB

Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan

Royal society of chemistry activities to develop a data repository for chemis...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

Pieper NISO Virtual Conf Feb17National Information Standards Organization (NISO)

SRA-System (7).ppsxlaibayyy38

Data processing with spark in r & pythonMaloy Manna, PMP®

Couchbase Meetup Jan 2016Michael Kehoe

Similar to Spark Summit EU talk by Emlyn Whittick (20)

Sa introduction to big data pipelining with cassandra & spark west mins...

How to deploy Apache Spark in a multi-tenant, on-premises environment

Integrating Apache Spark and NiFi for Data Lakes

ChemSpider – disseminating data and enabling an abundance of chemistry platforms

Drupal and Apache Stanbol

Spark Resource Manager

If You Have The Content, Then Apache Has The Technology!

Serverless spark

Data Science at Scale with Apache Spark and Zeppelin Notebook

SWIB14 Weaving repository contents into the Semantic Web

So You Want to Build a Data Lake?

Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...

Accelerating Delivery of Data Products - The EBSCO Way

Royal society of chemistry activities to develop a data repository for chemis...

CouchbasetoHadoop_Matt_Michael_Justin v4

Pieper NISO Virtual Conf Feb17

SRA-System (7).ppsx

Data processing with spark in r & python

Couchbase Meetup Jan 2016

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

ASML's Taxonomy Adventure by Daniel Cantervoginip

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

ASML's Taxonomy Adventure by Daniel Canter

Real-Time AI Streaming - AI Max Princeton

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Heart Disease Classification Report: A Data Analysis Project

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh

Generative AI for Social Good at Open Data Science East 2024

Identifying Appropriate Test Statistics Involving Population Mean

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Data Factory in Microsoft Fabric (MsBIP #82)

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Defining Constituents, Data Vizzes and Telling a Data Story

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Spark Summit EU talk by Emlyn Whittick

1. SPARK SUMMIT EUROPE2016 TALK DATA TO ME: SPARKING INSIGHTS AT ELSEVIER Emlyn Whittick Principal Software Engineer, Elsevier

2. Elsevier • Founded in 1880 (136 years old) • Started life as a traditional publisher • Specialised in scientific and medical publishing • Now an information solutions provider

3. LEAD THE WAY IN ADVANCING SCIENCE, TECHNOLOGY AND HEALTH Elsevier’s Mission

4. A Wealth of Data Article abstracts Article citation data Full text articles Profile data Institutional data Funding data Usage data Editorial data Social networks

5. The Challenge

6. The Challenge

7. COLLECTING THE INGREDIENTS Part One

8. Let’s Go Shopping

9. Collecting Ingredients Standardised, automated collection and delivery Proprietary, mostly automated mechanisms Manual, ad-hoc collection and delivery

10. Getting Access

11. ALONG CAME A SPARK Part Two

12. An Initial Spark • Use Case: Application of NLP on Elsevier’s entire body of published content • Databricks selected for: – Mounting of data – Minimal operational overhead – Presentation of results

13. An Initial Spark • Databricks used by team of 15 people for content analytics • Spark adopted as the main processing engine for Elsevier’s big data platform

14. An Initial Spark • Spark 2.0 and Spark 1.6 • RDD to DataFrame, and now to DataSet • Production workflows in Scala • Data Science and Analytics workflows may be in Python or R

15. A Recipe for Insights • Empower the master chefs • Prepare the ingredients • Share pre-prepared and cooked dishes • Serve delicious dishes

16. PREPARING OUR INGREDIENTS Step Three

17. CSV Potato • Old Classic • All shapes and sizes • Often dirty • Sometimes it’s a sweet potato

18. XML Onion • Lots of nested layers • Sometimes it makes you cry • Pretty much everyone has some somewhere

19. Delivering the Groceries • Large suppliers have well established delivery mechanisms • The local store may be less reliable • The local farm even less so

20. Sparking Data Preparation • 200 million article abstracts • Stored in Amazon S3 • XML files named by ID • Data delivery notifications via SNS • Variable file sizes (kB to MB)

21. Sparking Data Preparation • Skewed key distribution made processing difficult • Data pre-processing using Spark Streaming • Data processing in batch using Spark and Parquet as an output format

22. LET’S GET COOKING Step Four

23. Cooking our Ingredients • Focus on sharing cooked ingredients • Cooking in batches has its advantages – Verifying consistency – Testing and releasing batches – Recoverability

24. Cooking with Spark • Our data processing is exclusively done in Spark • Cooking is mainly done in batch but we’re looking at more and more streaming • Most synthesized batch datasets stored as Parquet

25. Cooking Citations • 200 million article abstracts in XML format • Calculation of article citations by article and author • Transformed to key value JSON for REST API • Mounted and shared in Databricks

26. Self Service with Databricks • Data from the data lake mounted in DBFS • Single shared cluster provided for general use • Bespoke clusters for heavy workloads 0 10 20 30 40 50 60 70 80 Databricks Users at Elsevier

27.

28. Databricks Use Cases • Author relationships and graphs • Author disambiguation research • Article recommendations • Data profiling and exploration • Learning

29. Databricks Use Cases

30. HAUTE CUISINE Next Up

31. Our Road Ahead • Fine-grained data access, privacy and security • Data discovery and provenance • Data cleansing and classification • Enhanced operational support

32. SPARK SUMMIT EUROPE2016 THANK YOU. e.whittick@elsevier.com

33.

34. HPCC • Developed for processing public records • Limited flexibility • Limited support and community • High barrier to entry • Difficult to get back to the original data • Not tolerant to faults

Spark Summit EU talk by Emlyn Whittick

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Summit EU talk by Emlyn Whittick

Similar to Spark Summit EU talk by Emlyn Whittick (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark Summit EU talk by Emlyn Whittick