This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).
Everything you wanted to know about Apache Tez:
-- Distributed execution framework targeted towards data-processing applications.
-- Based on expressing a computation as a dataflow graph.
-- Highly customizable to meet a broad spectrum of use cases.
-- Built on top of YARN – the resource management framework for Hadoop.
-- Open source Apache incubator project and Apache licensed.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
Everything you wanted to know about Apache Tez:
-- Distributed execution framework targeted towards data-processing applications.
-- Based on expressing a computation as a dataflow graph.
-- Highly customizable to meet a broad spectrum of use cases.
-- Built on top of YARN – the resource management framework for Hadoop.
-- Open source Apache incubator project and Apache licensed.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
Presented at Scala Italy 2016 with Andrea Bessi
Neural networks and deep learning have seen a spectacular advance during the last few years and represent now the state of the art in tasks such as image recognition, automated translations and natural language processing.
Unfortunately, most of the high performance deep learning implementations are single-node only, not being therefore particularly scalable.
During this talk, we will demonstrate how Apache Spark, the fast and general engine for large-scale data processing, can be used to train artificial neural networks, thus allowing to achieve high performance and parallel computing at the same time.
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
At StampedeCon 2014, Stephen O’Sullivan (Silicon Valley Data Science) presented "Beyond a Big Data Pilot: Building a Production Data Infrastructure."
Creating a data architecture involves many moving parts. By examining the data value chain, from ingestion through to analytics, we will explain how the various parts of the Hadoop and big data ecosystem fit together to support batch, interactive and realtime analytical workloads.
By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including data acquisition, ingestion, storage, data services, analytics and data management. Most importantly, we’ll leave you with a framework for understanding these options and making choices.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Deep dive into spark streaming, topics include:
1. Spark Streaming Introduction
2. Computing Model in Spark Streaming
3. System Model & Architecture
4. Fault-tolerance, Check pointing
5. Comb on Spark Streaming
Advances in computational technologies in the last decade have created tremendous potential for companies to use analytics to gain competitive edge. Managers today have a unique opportunity (and daunting challenge) to alter their existing business models to stay competitive. With a plethora of available technologies, managers often find it difficult to understand their options and choose an analytics strategy that will help them achive their business goals
In this webinar we will explore:
- How analytically driven companies are leveraging technology to gain competitive advantage
- Five technological innovations of the last decade and how they present both challenges and opportunities for businesses
- Ways that managers can leverage these technological innovations in defining and implementing a successful analytics strategy
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
Big Data is moving from hype to reality for many organisations. The value proposition is clear and sponsorship is high, but how do organisations execute?
Join Oracle and Contexti to discuss the typical journey of a big data project from concept to pilot to production.
• Discuss our experience with a regional Telco
• Common Use Cases across key verticals
• Defining and prioritising use cases
• The challenge of moving from Pilot to Production
• Common Operating Models for Big Data
• Funding a Big Data Capability going forward
• Pilots - common mistakes; challenges; success criteria
Energy companies deal with huge amounts of data and Apache Spark is an ideal platform to develop machine learning applications for forecasting and pricing. In this talk, we will discuss how Apache Spark’s MLlib library can be used to build scalable analytics for clustering, classification and forecasting primarily for energy applications using electricity and weather datasets.Through a demo, we will illustrate a workflow approach to accomplish an end-to-end pipeline from data pre-processing to deployment for the above use-case using PySpark, Python etc.
In Hadoop in Taiwan 2013 event, engineer of TCloud Computing presented the security concepts and features of Hadoop, how to script Crypto API, configuration details and future development.
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
With the introduction of YARN, Hadoop has emerged as a first class citizen in the data center as a single Hadoop cluster can now be used to power multiple applications and hold more data. This advance has also put a spotlight on a need for more comprehensive approach to Hadoop security.
Hortonworks recently acquired Hadoop security company XA Secure to provide a common interface for central administration of security policy and coordinated enforcement across authentication, authorization, audit and data protection for the entire Hadoop stack.
In this presentation, Balaji Ganesan and Bosco Durai (previously with XA Secure, now with Hortonworks) introduce HDP Advanced Security, review a comprehensive set of Hadoop security requirements and demonstrate how HDP Advanced Security addresses them.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
In these slides is given an overview of the different parts of Apache Spark.
We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples.
Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark),
spark streaming, streaming transformation stateless vs stateful, sliding windows, examples
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
This talk takes you on a rollercoaster ride through Hadoop 2 and explains the most significant changes and components.
The talk has been held on the JavaLand conference in Brühl, Germany on 25.03.2014.
Agenda:
- Welcome Office
- YARN Land
- HDFS 2 Land
- YARN App Land
- Enterprise Land
Talk held at a combined meeting of the Web Performance Karlsruhe (http://www.meetup.com/Karlsruhe-Web-Performance-Group/events/153207062) & Big Data Karlsruhe/Stuttgart (http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/162836152) user groups.
Agenda:
- Why Hadoop 2?
- HDFS 2
- YARN
- YARN Apps
- Write your own YARN App
- Tez, Hive & Stinger Initiative
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
Der Talk wurde am 11.12.2013 auf der Java User Group Karlsruhe gehalten und gibt einen Überblick und Einstieg in MongoDB aus der Sicht eines Java-Programmierers.
Dabei werden folgende Themen behandelt:
- Buzzword Bingo: NoSQL, Big Data, Horizontale Skalierung, CAP-Theorem, Eventual Consistency
- Übersicht über MongoDB
- Datenmanipulation: CRUD, Aggregation Framework, Map/Reduce
- Indexing
- Konsistenz beim Schreiben und Lesen von Daten
- Java API & Frameworks
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
Slides of my MongoDB Training given at Coding Serbia Conference on 18.10.2013
Agenda:
1. Introduction to NoSQL & MongoDB
2. Data manipulation: Learn how to CRUD with MongoDB
3. Indexing: Speed up your queries with MongoDB
4. MapReduce: Data aggregation with MongoDB
5. Aggregation Framework: Data aggregation done the MongoDB way
6. Replication: High Availability with MongoDB
7. Sharding: Scaling with MongoDB
Der Talk wurde am 25.09.2013 auf der Java User Group Frankfurt gehalten und gibt einen Überblick und Einstieg in MongoDB aus der Sicht eines Java-Programmierers.
Dabei werden folgende Themen behandelt:
- Buzzword Bingo: NoSQL, Big Data, Horizontale Skalierung, CAP-Theorem, Eventual Consistency
- Übersicht über MongoDB
- Datenmanipulation: CRUD, Aggregation Framework, Map/Reduce
- Indexing
- Konsistenz beim Schreiben und Lesen von Daten
- Java API & Frameworks
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
Talk given at MongoDb Munich on 16.10.2012 about the different approaches in MongoDB for using the Map/Reduce algorithm. The talk compares the performance of built-in MongoDB Map/Reduce, group(), aggregate(), find() and the MongoDB-Hadoop Adapter using a practical use case.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
6. 2 Spark: In a tweet
24.11.2014
“Spark … is what you might
call a Swiss Army knife of Big
Data analytics tools”
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
7. 2 Spark: In a nutshell
24.11.2014
• Fast and general engine for large scale data
processing
• Advanced DAG execution engine with support for
in-memory storage
data locality
(micro) batch streaming support
• Improves usability via
Rich APIs in Scala, Java, Python
Interactive shell
• Runs Standalone, on YARN, on Mesos, and on
Amazon EC2
8. 2 Spark is also…
24.11.2014
• Came out of AMPLab at UCB in 2009
• A top-level Apache project as of 2014
– http://spark.apache.org
• Backed by a commercial entity: Databricks
• A toolset for Data Scientist / Analysts
• Implementation of Resilient Distributed Dataset
(RDD) in Scala
• Hadoop Compatible
9. 2 Spark: Trends
24.11.2014
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez
Generated using http://www.google.com/trends/
14. 2 Spark: Core Concept
24.11.2014
• Resilient Distributed Dataset (RDD)
Conceptually, RDDs can be roughly
viewed as partitioned, locality aware
distributed vectors
RDD
A11
A12
A13
• Read-only collection of objects spread across a
cluster
• Built through parallel transformations actions
• Computation can be represented by lazy evaluated
lineage DAGs composed by connected RDDs
• Automatically rebuilt on failure
• Controllable persistence
15. 2 Spark: RDD Example
24.11.2014
Base RDD from HDFS
lines = spark.textFile(“hdfs://...”)
errors =
lines.filter(_.startsWith(Error))
messages = errors.map(_.split('t')(2))
messages.cache()
RDD in memory
Iterative Processing
for (str - Array(“foo”, “bar”))
messages.filter(_.contains(str)).count()
16. 2 Spark: Transformations
24.11.2014
Transformations -
Create new datasets from existing ones
map
18. 2 Spark: Actions
24.11.2014
Actions -
Return a value to the client after running a
computation on the dataset
reduce
19. 2 Spark: Actions
24.11.2014
Actions -
Return a value to the client after running a
computation on the dataset
reduce(func)
collect()
count()
first()
countByKey()
foreach(func)
take(n)
takeSample(withReplacement,num, [seed])
takeOrdered(n, [ordering])
saveAsTextFile(path)
saveAsSequenceFile(path)
(Only Java and Scala)
saveAsObjectFile(path)
(Only Java and Scala)
20. 2 Spark: Dataflow
24.11.2014
All transformations in Spark are lazy and are only
computed when an actions requires it.
21. 2 Spark: Persistence
24.11.2014
One of the most important capabilities in Spark is
caching a dataset in-memory across operations
• cache() MEMORY_ONLY
• persist() MEMORY_ONLY
22. 2 Spark: Storage Levels
24.11.2014
• persist(Storage Level)
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does
not fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. This is the default
level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does
not fit in memory, store the partitions that don't fit on disk, and
read them from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition).
This is generally more space-efficient than deserialized objects,
especially when using a fast serializer, but more CPU-intensive to
read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2,
… … …
Same as the levels above, but replicate each partition on two cluster
nodes.
23. 2 Spark: Parallelism
24.11.2014
Can be specified in a number of different ways
• RDD partition number
• sc.textFile(input, minSplits = 10)
• sc.parallelize(1 to 10000, numSlices = 10)
• Mapper side parallelism
• Usually inherited from parent RDD(s)
• Reducer side parallelism
• rdd.reduceByKey(_ + _, numPartitions = 10)
• rdd.reduceByKey(partitioner = p, _ + _)
• “Zoom in/out”
• rdd.repartition(numPartitions: Int)
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
24. 2 Spark: Example
24.11.2014
Text Processing Example
Top words by frequency
25. 2 Spark: Frequency Example
24.11.2014
Create RDD from external data
Data Sources supported by
Hadoop
Cassandra ElasticSearch
HDFS S3 HBase
Mongo
DB
…
I/O via Hadoop optional
// Step 1. Create RDD from Hadoop text files
val docs = spark.textFile(“hdfs://docs/“)
26. 2 Spark: Frequency Example
24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]
27. 2 Spark: Frequency Example
24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]
=
.map(_.ToLowerCase)
28. 2 Spark: Frequency Example
24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
=
// Step 2. Convert lines to lower case
val lower = docs.map(line = line.ToLowerCase)
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]
.map(_.ToLowerCase)
29. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[Array[String]]
hello
spark
_.split(s+)
world
this is spark
the end
30. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
spark
.flatten*
_.split(s+)
world
this is spark
hello
world
this
the end
end
31. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
world
this is spark
spark
.flatten*
_.split(s+)
the end
.flatMap(line = line.split(“s+“))
hello
world
this
end
32. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
world
this is spark
spark
.flatten*
_.split(s+)
hello
world
this
the end
end
.flatMap(line = line.split(“s+“))
// Step 3. Split lines into words
val words = lower.flatMap(line = line.split(“s+“))
33. 2 Spark: Frequency Example
24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1
34. 2 Spark: Frequency Example
24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
=
.map(word = (word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1
35. 2 Spark: Frequency Example
24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
=
.map(word = (word,1))
// Step 4. Convert into tuples
val counts = words.map(word = (word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1
36. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))]
.groupByKey
end 1
hello 1
spark 1 1
world 1
37. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
38. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
.reduceByKey((a,b) = a+b)
39. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
spark
end
1
1
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
// Step 5. Count all words
val freq = counts.reduceByKey(_ + _)
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
40. 2 Spark: Frequency Example
24.11.2014
Top N (Prepare data)
RDD[(String, Int)]
end 1
hello 1
spark 2
world 1
// Step 6. Swap tupels (Partial code)
freq.map(_.swap)
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
.map(_.swap)
41. 2 Spark: Frequency Example
24.11.2014
Top N (First Attempt)
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
.sortByKey
42. 2 Spark: Frequency Example
24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
local top N
.top(N)
local top N
43. 2 Spark: Frequency Example
24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
.top(N)
Array[(Int, String)]
2 spark
1 end
local top N
local top N
reduction
44. 2 Spark: Frequency Example
24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2
spark
1 world
RDD[(Int, String)]
spark
2
1 end
1 hello
1 world
.top(N)
Array[(Int, String)]
2 spark
1 end
local top N
local top N
reduction
// Step 6. Swap tupels (Complete code)
val top = freq.map(_.swap).top(N)
45. 2 Spark: Frequency Example
24.11.2014
val spark = new SparkContext()
// Create RDD from Hadoop text file
val docs = spark.textFile(“hdfs://docs/“)
// Split lines into words and process
val lower = docs.map(line = line.ToLowerCase)
val words = lower.flatMap(line = line.split(“s+“))
val counts = words.map(word = (word,1))
// Count all words
val freq = counts.reduceByKey(_ + _)
// Swap tupels and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
50. 2 Spark: SQL
24.11.2014
• Spark SQL allows relational queries
expressed in SQL, HiveQL or Scala
• Uses SchemaRDD’s composed of Row objects
(= table in a traditional RDBMS)
• SchemaRDD can be created from an
• Existing RDD
• Parquet File
• JSON dataset
• By running HiveQL against data stored in Apache Hive
• Supports a domain specific language for
writing queries
51. 2 Spark: SQL
24.11.2014
registerFunction(LEN, (_: String).length)
val queryRdd = sql(
SELECT * FROM counts
WHERE LEN(word) = 10
ORDER BY total DESC
LIMIT 10
)
queryRdd
.map( c = sword: ${c(0)} t| total: ${c(1)})
.collect()
.foreach(println)
52. 2 Spark: GraphX
24.11.2014
• GraphX is the Spark API for graphs
and graph-parallel computation
• API’s to join and traverse graphs
• Optimally partitions and indexes
vertices edges (represented as RDD’s)
• Supports PageRank, connected
components, triangle counting, …
53. 2 Spark: GraphX
24.11.2014
val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile(graphx/data/users.txt)
val users = userRdd. map {line =
val fields = line.split(,)
(fields(0).toLong, fields( 1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) = (username, rank)
}
54. 2 Spark: MLlib
24.11.2014
• Machine learning library similar to
Apache Mahout
• Supports statistics, regression, decision
trees, clustering, PCA, gradient
descent, …
• Iterative algorithms much faster due to
in-memory processing
55. 2 Spark: MLlib
24.11.2014
val data = sc.textFile(data.txt)
val parsedData = data.map {line =
val parts = line.split(',')
LabeledPoint(
parts( 0). toDouble,
Vectors.dense(parts(1).split(' ').map(_.toDouble)) )
}
val model = LinearRegressionWithSGD.train(
parsedData, 100
)
val valuesAndPreds = parsedData.map {point =
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds
.map{case(v, p) = math.pow((v - p), 2)}.mean()
57. 2 Use Case: Yahoo Native Ads
24.11.2014
Logistic regression
algorithm
• 120 LOC in Spark/Scala
• 30 min. on model creation for
100M samples and 13K
features
Initial version launched
within 2 hours after Spark-on-
YARN announcement
• Compared: Several days on
hardware acquisition, system
setup and data movement
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
58. 2 Use Case: Yahoo Mobile Ads
24.11.2014
Learn from mobile search
ads clicks data
• 600M labeled examples on
HDFS
• 100M sparse features
Spark programs for
Gradient Boosting Decision
Trees
• 6 hours for model training
with 100 workers
• Model with accuracy very
close to heavily-manually-tuned
Logistic Regression
models
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
62. 2 Spark: Future work
24.11.2014
• Spark Core
• Focus on maturity, optimization
pluggability
• Enable long-running services (Slider)
• Give resources back to cluster when idle
• Integrate with Hadoop enhancements
• Timeline server
• ORC File Format
• Spark Eco System
• Focus on adding capabilities
63. 2 One more thing…
24.11.2014
Let’s get started with
Spark!