This document summarizes how Scala and Hadoop are used at eBay. It discusses:
- Why Scala is used, including its functional capabilities and JVM compatibility.
- Why Hadoop is used to process eBay's petabytes of data across its large cluster.
- How Scalding, a Scala library, allows complex Hadoop jobs to be written concisely and tested effectively, improving on other frameworks like Pig and Cascading.
Code examples show how tasks like collaborative filtering, search query analysis, and Markov chains can be implemented in a readable way using Scalding.
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
Have you ever written a long project for a simple column rename and thought, this should be easier? What about nicely named output statements? Yeah they bother me too. Oh, and DEDUP(SORT(DISTINCT()))? There is a better way! Learn how Dapper can help!
WebAssembly. Neither Web Nor Assembly, All RevolutionaryC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2tWqrMm.
Jay Phelps talks about WebAssembly, a bytecode designed and maintained by some of the major players in tech: Google, Microsoft, Apple, Mozilla, Intel, LG, and many others. He talks about what WebAssembly is and what it isn’t. Filmed at qconsf.com.
Jay Phelps is the Chief Software Architect and Co-founder at This Dot, where they provide support, training, mentoring, and software design. Previously, he worked as a Senior Software Engineer at Netflix.
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoTaro L. Saito
Silk is a framework for building dataflows in Scala. In Silk users write data processing code with collection operators (e.g., map, filter, reduce, join, etc.). Silk uses Scala Macros to construct a DAG of dataflows, nodes of which are annotated with variable names in the program. By using these variable names as markers in the DAG, Silk can support interruption and resume of dataflows and querying the intermediate data. By separating dataflow descriptions from its computation, Silk enables us to switch executors, called weavers, for in-memory or cluster computing without modifying the code. In this talk, we will show how Silk helps you run data-processing pipelines as you write the code.
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
Have you ever written a long project for a simple column rename and thought, this should be easier? What about nicely named output statements? Yeah they bother me too. Oh, and DEDUP(SORT(DISTINCT()))? There is a better way! Learn how Dapper can help!
WebAssembly. Neither Web Nor Assembly, All RevolutionaryC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2tWqrMm.
Jay Phelps talks about WebAssembly, a bytecode designed and maintained by some of the major players in tech: Google, Microsoft, Apple, Mozilla, Intel, LG, and many others. He talks about what WebAssembly is and what it isn’t. Filmed at qconsf.com.
Jay Phelps is the Chief Software Architect and Co-founder at This Dot, where they provide support, training, mentoring, and software design. Previously, he worked as a Senior Software Engineer at Netflix.
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoTaro L. Saito
Silk is a framework for building dataflows in Scala. In Silk users write data processing code with collection operators (e.g., map, filter, reduce, join, etc.). Silk uses Scala Macros to construct a DAG of dataflows, nodes of which are annotated with variable names in the program. By using these variable names as markers in the DAG, Silk can support interruption and resume of dataflows and querying the intermediate data. By separating dataflow descriptions from its computation, Silk enables us to switch executors, called weavers, for in-memory or cluster computing without modifying the code. In this talk, we will show how Silk helps you run data-processing pipelines as you write the code.
Hands-on Java 8 with examples and open discussion about the more relevant new feature of Java 8: lambdas, streams, CompletableFeature, new Date & Time API.
Journey's End – Collection and Reduction in the Stream APIMaurice Naftalin
Popular tutorial (81/85 positive ratings) at JavaOne, San Francisco, October 2015. Video of earlier version (JavaOne 2014) is at https://www.youtube.com/watch?v=6tFZelz7Gvg.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct)
Les toutes dernières technologies de calcul parallèle permettent de calculer des modèles de prédiction sur des big datas en des temps records. Avec le cloud est facilité l'accès à des configurations hardware modernes avec la possibilité d'une scalabilité éphémère durant les calculs. Des benchmarks sont réalisés sur plusieurs configuration hardware, allant de 1 instance à un cluster de 100 instances.
Christopher Bourez, développeur & manager expert en systèmes d'information modernes chez Axa Global Direct. Alien thinker. Blog : http://christopher5106.github.io/
How Big Data platform scaled from zero to billions of data within 6 months at ISCPIF (CNRS).
This talk contains our use of Elasticsearch, MongoDB, Redis, RabbitMQ and scalable/high available Web services built over Big Data architecture.
This presentation was presented at Université Paris-Sud, LAL, Bâtiment 200 organized by ARGOS. https://indico.mathrice.fr/event/2/overview
ISCPIF: http://iscpif.fr
Big Data at ISCPIF: http://bigdata.iscpif.fr
Climate at ISCPIF: http://climate.iscpif.fr
Playground for climate: http://climate.iscpif.fr/playground
Tweetoscope: http://tweetoscope.iscpif.fr
Persistent Data Structures - partial::ConfIvan Vergiliev
The slides from my talk on Persistent Data Structures at http://partialconf.com/ . The "Implementation" part assumes a bit of prior knowledge on how persistent data structures work, but the rest should be generally accessible.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Hands-on Java 8 with examples and open discussion about the more relevant new feature of Java 8: lambdas, streams, CompletableFeature, new Date & Time API.
Journey's End – Collection and Reduction in the Stream APIMaurice Naftalin
Popular tutorial (81/85 positive ratings) at JavaOne, San Francisco, October 2015. Video of earlier version (JavaOne 2014) is at https://www.youtube.com/watch?v=6tFZelz7Gvg.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct)
Les toutes dernières technologies de calcul parallèle permettent de calculer des modèles de prédiction sur des big datas en des temps records. Avec le cloud est facilité l'accès à des configurations hardware modernes avec la possibilité d'une scalabilité éphémère durant les calculs. Des benchmarks sont réalisés sur plusieurs configuration hardware, allant de 1 instance à un cluster de 100 instances.
Christopher Bourez, développeur & manager expert en systèmes d'information modernes chez Axa Global Direct. Alien thinker. Blog : http://christopher5106.github.io/
How Big Data platform scaled from zero to billions of data within 6 months at ISCPIF (CNRS).
This talk contains our use of Elasticsearch, MongoDB, Redis, RabbitMQ and scalable/high available Web services built over Big Data architecture.
This presentation was presented at Université Paris-Sud, LAL, Bâtiment 200 organized by ARGOS. https://indico.mathrice.fr/event/2/overview
ISCPIF: http://iscpif.fr
Big Data at ISCPIF: http://bigdata.iscpif.fr
Climate at ISCPIF: http://climate.iscpif.fr
Playground for climate: http://climate.iscpif.fr/playground
Tweetoscope: http://tweetoscope.iscpif.fr
Persistent Data Structures - partial::ConfIvan Vergiliev
The slides from my talk on Persistent Data Structures at http://partialconf.com/ . The "Implementation" part assumes a bit of prior knowledge on how persistent data structures work, but the rest should be generally accessible.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Keep hearing about Plack and PSGI, and not really sure what they're for, and why they're popular? Maybe you're using Plack at work, and you're still copying-and-pasting `builder` lines in to your code without really knowing what's going on? What's the relationship between Plack, PSGI, and CGI? Plack from first principles works up from how CGI works, the evolution that PSGI represents, and how Plack provides a user-friendly layer on top of that.
Fast as C: How to Write Really Terrible JavaCharles Nutter
For years we’ve been told that the JVM’s amazing optimizers can take your running code and make it “fast” or “as fast as C++” or “as fast as C”…or sometimes “faster than C”. And yet we don’t often see this happen in practice, due in large part to (good and bad) development patterns that have taken hold in the Java world.
In this talk, we’ll explore the main reasons why Java code rarely runs as fast as C or C++ and how you can write really bad Java code that the JVM will do a better job of optimizing. We’ll take some popular microbenchmarks and burn them to the ground, monitoring JIT logs and assembly dumps along the way.
A whirlwind tour of the modules that any perl hacker, from beginner to experienced, should use and why.
Handout: List of modules in the talk along with many more: https://sites.google.com/site/perlhercynium/TEPHT-List2.pdf?attredirects=0
Slides from talk on legacy data migration. Includes introduction of Trucker gem and covers common migration issues.
This talk was given by Patrick Crowley and Rob Kaufman at RubyMidwest 2010 in Kansas City, MO.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
12. BIG NUMBERS
• Petabytes of data
• 1k+ node Hadoop cluster
• Multi-billion dollar merchandising business
• Lots of users and items
13. How should I use Map Reduce?
• Raw map reduce
• Pig
• Hive
• Cascading
• Scoobi
• Scalding
14. Decision Time
• “And every one that heareth these sayings of
mine (great software engineers of the past),
and doeth them not, shall be likened unto a
foolish man, which built his house upon the
sand.”
• “And the rain descended, and the floods
came, and the winds blew, and beat upon that
house; and it fell: and great was the fall of it.”
16. Good Pig
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
// do joins and group by also
17. Bad Pig
DEFINE NV_terms `perl nv_terms2.pl`
ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray,
name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as
name,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as
name1;
19. Cascading Rocks!
• What is it?
• Supports large workflows and reusable
components
– DAG generation
– Parallel Executions
20. Cascading code in Scala
val masterPipe = new
FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new
FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe,
CFields("user_id", "epoch_ts", "sqr"),
sortFields)
30. Markov Chains
• Investigation of buying patterns in ~50 lines of
code
val purchases = "firsttime" :: x.take(500).toList
val pairs = purchases zip purchases.tail
val grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString)
val sizes = grouped map { x => {
x._1 -> x._2.size
}} toList
31. Mining Search Queries
• 20+ billion user queries - give me the top ones
per user
De-Dupe Rank ValidateSample Data
Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
Tell them about the example
The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
Say why raw map reduce stinks. Mention what hive is and scoobi is
Explain why we didn’t go with scoobi even though it’s all scala
Scheduling and DAG creationWhere is my SOURCE?
Mentionazkaban
Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
Verbose. You still need to write a bunch of code.
Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
This is actual code to compute a user’s preferences. Explain a bit about user preferences
Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
Talk about the use of cascadesTalk about traps and counters
Scalding makes this 100% times easier because of cascades and flows