Tez

Technology

What is it?
• Complex Directedacyclic-graph tasks
for processing data
•

Built atop Apache
Hadoop YARN

Rafael Souza
@rafael_psouza
rafaelpsouza
rafaelpsouza

At Apple we rely on processing large datasets to power key components of Apple’s largest production services. Spark is continuing to replace and augment traditional MR workloads with its speed and low barrier to entry. Our current analytics infrastructure consists of over an exabyte of storage and close to a million cores. Our footprint is also growing further with the addition of new elastic services for streaming, adhoc and interactive analytics. In this talk we will cover the challenges of working at scale with tricks and lessons learned managing large multi-tenant clusters. We will also discuss designing and building a self-service elastic analytics platform on Mesos.

Performance of Spark vs MapReduce

Getting Spark ready for real-time, operational analytics

airisData

Apache spark

TEJPAL GAUTAM

Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

As part of its machine learning benchmarking efforts, MLCommons (mlcommons.org) has built an 86,000 hour open supervised speech recognition dataset with a commercial-use license known as The People’s Speech, incorporating subtitled videos and audio in the public domain scraped from the Internet. Creating a speech recognition dataset requires running inference on a pre-trained neural network speech recognition model to “force align” audio against a transcript (in this case, subtitles). In order to improve upon an initial CPU-based pipeline that took approximately 3,500 CPU days to one that takes 24 hours end-to-end, we created a hybrid data pipeline that used Apache Spark for general data processing and Google Cloud Tensor Processing Units (TPUs) for running the neural network speech recognition model. I will describe in-the-weeds learnings on how to (1) use a non-GPU accelerator with Spark for inference, (2) share physical memory fairly between the pyspark UDF worker.py process and JVM process in the same executor, and (3) implement efficient joins of data that has been reordered relative to its source dataframe by batching by sequence length (tf.data.experimental.bucket_by_sequence_length). If you do offline inference on sequence data with deep learning models, this session is for you. Our entire pipeline is open source under an Apache 2 license at https://github.com/mlcommons/peoples-speech.

http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.

Apache spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014. This document shares some basic knowledge about Apache Spark.

Solr + Hadoop: Interactive Search for Hadoop

gregchanan

Using Visualization to Succeed with Big Data Pactera_US

Apache spark - Architecture , Overview & libraries

Walaa Hamdy Assy

Apache Spark beyond Hadoop MapReduce

Spark for big data analytics

Deep Dive into the New Features of Apache Spark 3.1

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos. The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.

An Introduction to Apache Spark

5 reasons why spark is in demand!

DataWorks Summit/Hadoop Summit

Meeting Performance Goals in multi-tenant Hadoop Clusters

Adios hadoop, Hola Spark! T3chfest 2015

dhiguero

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Alex Zeltov

Introduction to Big Data Analytics using Apache Spark on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS) This workshop will provide an introduction to Big Data Analytics using Apache Spark using the HDInsights on Azure (SaaS) and/or HDP deployment on Azure(PaaS) . There will be a short lecture that includes an introduction to Spark, the Spark components. Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes. The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.

How Apache Spark Is Helping Tame the Wild West of Wi-Fi

Spark Summit

Designing the Next Generation of Data Pipelines at Zillow with Apache Spark

Advanced Analytics and Big Data (August 2014)

Thomas W. Dinsmore

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

Data Con LA

Prototypes are typically re-implemented in another language due to compatibility issues with R in the enterprise, but TIBCO Enterprise Runtime for R (TERR) allows the language to be run on several platforms. Enterprise-level scalability has been brought to the R language, enabling rapid iteration without the need to recode, re-implement and test. This presentation will delve further into these topics, highlighting specific use cases and the true value that can be gained from utilizing R. The session will be followed by a lively, open Q&A discussion.

Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase

Frequently Bought Together Recommendations Based on Embeddings

Santa Barbara Association of REALTORS Presentation 1.19.10

Jeff Dowler

Introduction to SaltStack

What's hot

DoneDeal - AWS Data Analytics Platform

martinbpeters

Intro to Apache Spark by CTO of Twingo

MapR Technologies

Apache spark

Solr + Hadoop: Interactive Search for Hadoop

gregchanan

Using Visualization to Succeed with Big Data Pactera_US

Apache spark - Architecture , Overview & libraries

Walaa Hamdy Assy

Apache Spark beyond Hadoop MapReduce

Spark for big data analytics

Deep Dive into the New Features of Apache Spark 3.1

An Introduction to Apache Spark

5 reasons why spark is in demand!

DataWorks Summit/Hadoop Summit

Meeting Performance Goals in multi-tenant Hadoop Clusters

Adios hadoop, Hola Spark! T3chfest 2015

dhiguero

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Alex Zeltov

How Apache Spark Is Helping Tame the Wild West of Wi-Fi

Spark Summit

Designing the Next Generation of Data Pipelines at Zillow with Apache Spark

Advanced Analytics and Big Data (August 2014)

Thomas W. Dinsmore

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

Data Con LA

Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase

Frequently Bought Together Recommendations Based on Embeddings

What's hot (20)

DoneDeal - AWS Data Analytics Platform

Intro to Apache Spark by CTO of Twingo

Apache spark

Solr + Hadoop: Interactive Search for Hadoop

Using Visualization to Succeed with Big Data

Apache spark - Architecture , Overview & libraries

Apache Spark beyond Hadoop MapReduce

Spark for big data analytics

Deep Dive into the New Features of Apache Spark 3.1

An Introduction to Apache Spark

5 reasons why spark is in demand!

Meeting Performance Goals in multi-tenant Hadoop Clusters

Adios hadoop, Hola Spark! T3chfest 2015

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

How Apache Spark Is Helping Tame the Wild West of Wi-Fi

Designing the Next Generation of Data Pipelines at Zillow with Apache Spark

Advanced Analytics and Big Data (August 2014)

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

Fedbench - A Benchmark Suite for Federated Semantic Data Processing

Frequently Bought Together Recommendations Based on Embeddings

Viewers also liked

Santa Barbara Association of REALTORS Presentation 1.19.10

Jeff Dowler

Introduction to SaltStack

.Net 4.0 Threading and Parallel Programming

Migração de legado - Seniortec 2015

RSFBPW Social Media and Your Business 9.17.09_notes

Jeff Dowler

Creating a blog like a hacker

Antlr rafaelpsouzaRafael de Paula Souza

PicoContainerRafael de Paula Souza

Coleta, armazenamento e visualização de métricas em uma arquitetura de micros...

Apresentação realizada no TDC Porto Alegre - 2016 O monitoramento e a visibilidade da saúde e performance de componentes em uma arquitetura de microserviços é fundamental para determinar, de uma forma rápida, a causa raiz de possíveis problemas além de fornecer insights para melhorias de eficiência. Nessa apresentação vou contar um pouco do meu último ano trabalhando, para um cliente do Vale do Silício, com instrumentação, coleta, armazenamento e visualização de métricas (Observability) em uma arquitetura de microserviços na cloud. Além dos principais problemas e soluções encontradas vou abordar os seguintes tópicos: a arquitetura para instrumentação, coleta, armazenamento e visualização de métricas; Collectd; Sensu e SignaFx.

Software Design and Technical Debts

Cheap HPC

Space, Galaxies & Blackholes

Subhransu Behera

IronRuby

Agile Development Practices - Productivity

HTML Parsing With Hpricot

Subhransu Behera

NLP e Chatbots

Hacking and Securing iOS Apps : Part 1

Subhransu Behera

Viewers also liked (17)

Santa Barbara Association of REALTORS Presentation 1.19.10

Introduction to SaltStack

.Net 4.0 Threading and Parallel Programming

Migração de legado - Seniortec 2015

RSFBPW Social Media and Your Business 9.17.09_notes

Creating a blog like a hacker

Antlr rafaelpsouza

PicoContainer

Coleta, armazenamento e visualização de métricas em uma arquitetura de micros...

Software Design and Technical Debts

Cheap HPC

Space, Galaxies & Blackholes

IronRuby

Agile Development Practices - Productivity

HTML Parsing With Hpricot

NLP e Chatbots

Hacking and Securing iOS Apps : Part 1

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Knowledge engineering: from people to machines and back

Elena Simperl

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Mission to Decommission: Importance of Decommissioning Products to Increase E...