Using Scalding for Data Driven Product Development at LinkedIn

•Download as PPTX, PDF•

5 likes•791 views

This document discusses using Scalding, which combines Scala and Hadoop, for data-driven product development. Scalding provides a domain-specific language for writing MapReduce jobs in Scala that allows processing large datasets in Hadoop. The document describes how LinkedIn uses Scalding for various tasks including processing web and application data at large scales. It highlights benefits like succinct code, abstraction, and running thousands of Scalding jobs successfully in LinkedIn's production environment.

Technology

Using Scalding for Data-Driven
Product Development
Sasha Ovsankin
LinkedIn
Presented to Scala By The Bay
Aug 9, 2014

/summary
Data-Driven
Product
Development

/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala

/data-driven
Your
Amazing
Service
Value Data

“Online” World
/data-driven/linkedin
Web Applications
NoSQL Data
Stores
“Offline” World (Hadoop)
HDFS
Hadoop Jobs
Tracking/l
ogging
Analytics
Data
Products
Messaging
Message delivery
Databases

/linkedin/big-data/links
• “LinkedIn Big Data Ecosystem”
– http://lnkd.in/big-data-ecosystem
• Grid Operations
– http://lnkd.in/gridops2013

$/scalding http://github.com/twitter/scalding • Scala-based DSL for Map/Reduce jobs • Built on Cascading, stable and mature Hadoop framework • Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } • Succinct and powerful • High level of abstraction$

/data-driven/problem/scaling
• Problem: Scaling
• Solution
– Distributed processing
– High-level description of algorithms
– Functional programming

../problem/complexity
• Problem: Complexity
• Solution
– Consistent way of organizing data
• Self-describing data formats (Avro)
• File organization
– Type safety
– Modularization

/linkedin/hadoop/practices
• All online data end up in HDFS
– Avro encoding is standard
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem

$../solution/scala/killer-argument • Map & reduce -- primitives scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500$

/linkedin/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability, maintainability and tooling
support
• Dozens of flows are currently in production, and
counting
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]

/linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help them
become more productive and successful
• We are looking for amazing people interested in
Software Engineering and Data Science
– http://linkedin.com/careers
Questions?

What's hot

Azure data bricks by Eugene Polonichko

Alex Tumanoff

The term “Lambda Architecture” stands for a generic, scalable and fault-tolerant data processing architecture. As the hyper-scale now offers a various PaaS services for data ingestion, storage and processing, the need for a revised, cloud-native implementation of the lambda architecture is arising. In this talk we demonstrate the blueprint for such an implementation in Microsoft Azure, with Azure Databricks — a PaaS Spark offering – as a key component. We go back to some core principles of functional programming and link them to the capabilities of Apache Spark for various end-to-end big data analytics scenarios. We also illustrate the “Lambda architecture in use” and the associated tread-offs using the real customer scenario – Rijksmuseum in Amsterdam – a terabyte-scale Azure-based data platform handles data from 2.500.000 visitors per year.

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

Databricks

Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.

Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...

Michael Rys

Unleash the Power of Azure Data Factory - SQL User Group

Sergio Zenatti Filho

Eugene Polonichko "Architecture of modern data warehouse"

Lviv Startup Club

Analyzing big data is a challenge, requiring lots of processing power and storage. Cloud Computing is an ideal platform to tackle this problem. HD Insight on Microsoft Azure deploys Hadoop and other open source big data tools to the cloud, making it easier to take advantage of the high scalability of this platform. In this session, you will learn what tools are available in HD Insight and how to use them to store, process, and analyze large amounts of data.

Big Data on azure

David Giard

Azure data factory

David Giard

201905 Azure Databricks for Machine Learning

Mark Tabladillo

Snaplogic Live: Big Data in Motion

SnapLogic

Hello All, It is time for the second Tokyo Azure Meetup! As a natural continuation of our first topic, we will proceed with Big Data. Until recently you needed to learn new language or master new concepts in order get started with Big Data. Moreover, you needed to spend a lot of time setting up infrastructure that will meet the business demands for Big Data processing. Not any more! If you know C# and T-SQL you are ready to become Big Data master! Public cloud and especially Microsoft Azure are very well suited for working with Big Data. Join us for our next event and and I can assure you that after the session you will be ready to start working with Big Data. And maybe you are asking why this is important. I believe that we don't have choice but build smart applications and get as much possible insights from the data we collect from various sources in order to take the best business decisions and please our customers. Today we have so much data available publicly or coming from our customers and it is very challenging to process it and turn it into valuable business asset. Not any more! Join for our next meetup and you will see how Microsoft create amazing opportunity for each .Net developer to become Big Data expert and every company to start using Big Data to accelerate its growth. I have been working closely with the product team developing U-SQL language that empower Azure Data Lake Analytics, which is one of the processing engines for Azure Data Lake and I will be very happy to share my experience with you! See you very soon! Kanio

Tokyo azure meetup #2 big data made easy

Tokyo Azure Meetup

SnapLogic Live: Big Data Integration

SnapLogic

Building Data Lakes with Apache Airflow

Gary Stafford

Redash: Open Source SQL Analytics on Data Lakes

Databricks

Building a Self-Service Big Data Pipeline

DataWorks Summit

Feature store Overview St. Louis Big Data IDEA Meetup aug 2020

Adam Doyle

SnapLogic Live: Salesforce Integration

SnapLogic

Atlanta MLConf

Qubole

Modern data warehouse with Azure

Nilesh Gule

With Azure Data Lake Store, analyze all of your data in one place with no artificial constraints. Data Lake Store can store trillions of files. Azure Data Lake Analytics: Easily develop and run massively parallel data transformation and processing programs in U-SQL, R, Python, and .NET over petabytes of data. With no infrastructure to manage, you can process data on demand, scale instantly, and only pay per job.

Azure Data Lake Store and Analytics

Sergio Zenatti Filho

Disrupting Big Data with Apache Spark in the Cloud

Jen Aman

What's hot (20)

Azure data bricks by Eugene Polonichko

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...

Unleash the Power of Azure Data Factory - SQL User Group

Eugene Polonichko "Architecture of modern data warehouse"

Big Data on azure

Azure data factory

201905 Azure Databricks for Machine Learning

Snaplogic Live: Big Data in Motion

Tokyo azure meetup #2 big data made easy

SnapLogic Live: Big Data Integration

Building Data Lakes with Apache Airflow

Redash: Open Source SQL Analytics on Data Lakes

Building a Self-Service Big Data Pipeline

Feature store Overview St. Louis Big Data IDEA Meetup aug 2020

SnapLogic Live: Salesforce Integration

Atlanta MLConf

Modern data warehouse with Azure

Azure Data Lake Store and Analytics

Disrupting Big Data with Apache Spark in the Cloud

Similar to Using Scalding for Data Driven Product Development at LinkedIn

How LinkedIn Uses Scalding for Data Driven Product Development

Sasha Ovsankin

Rajeev kumar apache_spark & scala developer

Rajeev Kumar

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15

MLconf

Big Data Processing with Apache Spark 2014

mahchiev

In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game. Accompanying Blog: Coming Soon! Accompanying YouTube: Coming Soon! Sign Up For Our Newsletter: http://eepurl.com/grdMkn Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday: https://www.meetup.com/Data-Wranglers-DC/events/ Cassandra.Link: https://cassandra.link/ Follow Us and Reach Us At: Anant: https://www.anant.us/ Awesome Cassandra: https://github.com/Anant/awesome-cassandra Email: solutions@anant.us LinkedIn: https://www.linkedin.com/company/anant/ Twitter: https://twitter.com/anantcorp Eventbrite: https://www.eventbrite.com/o/anant-1072927283 Facebook: https://www.facebook.com/AnantCorp/ Join The Anant Team: https://www.careers.anant.us

Data Engineer's Lunch #55: Get Started in Data Engineering

Anant Corporation

Dev Ops Training

Spark Summit

Nodes2020 | Graph of enterprise_metadata | NEO4J Conference

Deepak Chandramouli

Lambda architecture with Spark

Vincent GALOPIN

Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL. - Overview of Spark in the Hadoop ecosystem - Deep dive into Spark SQL with step by steps on how to implement and use it If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/

Spark SQL

Caserta

Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.

Architecting Agile Data Applications for Scale

Databricks

In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...

Lillian Pierson

Our cofounder Alex Dean gave an introduction to Snowplow and then talked about our roadmap for 2017. Alex touched on several topics including support for more clouds, support for more storage targets, tailoring Snowplow to your industry, more intelligent event sources, moving our batch pipeline to Spark, mega-scale Snowplow and real-time support for Sauna, our decisioning and response system. Presented on 5 April 2017.

Snowplow presentation for Amsterdam Meetup #3

Snowplow Analytics

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

Anant Corporation

Democratization of Data @Indix

Manoj Mahalingam

deep learning in production cff 2017

Ari Kamlani

We’ll talk about the changes in the industry that customers are faced with and how Red Hat Hyperconverged Infrastructure can address those challenges . Our customers are struggling not only to manage the growth of big data (structured and unstructured), but also to reap timely business insights from their data using their existing data infrastructure like monolithic Hadoop clusters. This often leads to alternative approaches that often lead to disappointing results.

Managing data analytics in a hybrid cloud

Karan Singh

Developing applications to run on the most important Database Manager in the world ? Why not do it in the cloud? With Oracle Database Cloud Service, developers can quickly and easily access the power and flexibility of the Oracle database in the cloud. With a choice between an instance or a dedicated database with full administrative control, or a schema dedicated to a development platform and full deployment managed by Oracle, developers can decide how much control they have over their development environments. Attend this session to learn more about the features and benefits of Oracle Database Cloud.

Fast, Flexible Application Development with Oracle Database Cloud Service

Gustavo Rene Antunez

Transitioning Compute Models: Hadoop MapReduce to Spark

Slim Baltagi

Big Data & Oracle Technologies

Oleksii Movchaniuk

Hadoop workshop

Fang Mac

Similar to Using Scalding for Data Driven Product Development at LinkedIn (20)

How LinkedIn Uses Scalding for Data Driven Product Development

Rajeev kumar apache_spark & scala developer

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15

Big Data Processing with Apache Spark 2014

Data Engineer's Lunch #55: Get Started in Data Engineering

Dev Ops Training

Nodes2020 | Graph of enterprise_metadata | NEO4J Conference

Lambda architecture with Spark

Spark SQL

Architecting Agile Data Applications for Scale

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...

Snowplow presentation for Amsterdam Meetup #3

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

Democratization of Data @Indix

deep learning in production cff 2017

Managing data analytics in a hybrid cloud

Fast, Flexible Application Development with Oracle Database Cloud Service

Transitioning Compute Models: Hadoop MapReduce to Spark

Big Data & Oracle Technologies

Hadoop workshop

Recently uploaded

Angeliki Cooney has spent over twenty years at the forefront of the life sciences industry, working out of Wynantskill, NY. She is highly regarded for her dedication to advancing the development and accessibility of innovative treatments for chronic diseases, rare disorders, and cancer. Her professional journey has centered on strategic consulting for biopharmaceutical companies, facilitating digital transformation, enhancing omnichannel engagement, and refining strategic commercial practices. Angeliki's innovative contributions include pioneering several software-as-a-service (SaaS) products for the life sciences sector, earning her three patents. As the Senior Vice President of Life Sciences at Avenga, Angeliki orchestrated the firm's strategic entry into the U.S. market. Avenga, a renowned digital engineering and consulting firm, partners with significant entities in the pharmaceutical and biotechnology fields. Her leadership was instrumental in expanding Avenga's client base and establishing its presence in the competitive U.S. market.

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Angeliki Cooney

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

DBX First Quarter 2024 Investor Presentation

Dropbox

Recently uploaded (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

FWD Group - Insurer Innovation Award 2024

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

How to Troubleshoot Apps for the Modern Connected Worker

Corporate and higher education May webinar.pptx

MS Copilot expands with MS Graph connectors

Why Teams call analytics are critical to your entire business

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Boost Fertility New Invention Ups Success Rates.pdf

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

MINDCTI Revenue Release Quarter One 2024

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

DBX First Quarter 2024 Investor Presentation

Using Scalding for Data Driven Product Development at LinkedIn

1. Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn Presented to Scala By The Bay Aug 9, 2014

2. /summary Data-Driven Product Development

3. /summary Data-Driven Product Development Scalding = Hadoop + Scala

4. /summary Data-Driven Product Development Scalding = Hadoop + Scala

5. /data-driven Your Service

6. /data-driven Your Service Value

7. /data-driven Your Service Value Data

8. /data-driven Your Service Value Data

9. /data-driven Your Service Value Data

10. /data-driven Your Amazing Service Value Data

11. “Online” World /data-driven/linkedin Web Applications NoSQL Data Stores “Offline” World (Hadoop) HDFS Hadoop Jobs Tracking/l ogging Analytics Data Products Messaging Message delivery Databases

12. /linkedin/big-data/links • “LinkedIn Big Data Ecosystem” – http://lnkd.in/big-data-ecosystem • Grid Operations – http://lnkd.in/gridops2013

13. /scalding http://github.com/twitter/scalding • Scala-based DSL for Map/Reduce jobs • Built on Cascading, stable and mature Hadoop framework • Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } • Succinct and powerful • High level of abstraction

14. /data-driven/problem/scaling • Problem: Scaling • Solution – Distributed processing – High-level description of algorithms – Functional programming

15. …/solution/scalding

16. ../problem/complexity • Problem: Complexity • Solution – Consistent way of organizing data • Self-describing data formats (Avro) • File organization – Type safety – Modularization

17. …/solution/scalding

18. /linkedin/hadoop/practices • All online data end up in HDFS – Avro encoding is standard • Production Process – CI/Automatic Build • More info forthcoming – Production Review – Operations and Monitoring • More info at http://lnkd.in/gridops2013 • Result: Thousands of jobs running in production • More info at http://lnkd.in/big-data-ecosystem

19. ../solution/scala/killer-argument • Map & reduce -- primitives scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500

20. /linkedin/scalding/status • Started >1 year ago • Thousands of production LOC written in Scalding by our team – Pretty happy with readability, maintainability and tooling support • Dozens of flows are currently in production, and counting • Created Scalding user group • Growing interest • Learning: – Scala[Scalding] < Scala[ _ ]

21. /summary Data-Driven Product Development Scalding = Hadoop + Scala

22. /linkedin/join-us • Work on unique and interesting problems • Be part of great engineering community • Use latest tools and technologies • Help connect the world’s professionals to help them become more productive and successful • We are looking for amazing people interested in Software Engineering and Data Science – http://linkedin.com/careers Questions?

Using Scalding for Data Driven Product Development at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Scalding for Data Driven Product Development at LinkedIn

Similar to Using Scalding for Data Driven Product Development at LinkedIn (20)

Recently uploaded

Recently uploaded (20)

Using Scalding for Data Driven Product Development at LinkedIn