In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Apache Spark: The Next Gen toolset for Big Data Processingprajods
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
Reference architecture for Internet of ThingsSujee Maniyam
What kind of a data infrastructure is needed, to support Internet of Things?
This talk presents a reference architecture.
We are actually building this architecture as open source project. See here : bit.ly / iotxyz
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Apache Spark: The Next Gen toolset for Big Data Processingprajods
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
Reference architecture for Internet of ThingsSujee Maniyam
What kind of a data infrastructure is needed, to support Internet of Things?
This talk presents a reference architecture.
We are actually building this architecture as open source project. See here : bit.ly / iotxyz
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Apache Mesos delivers resource management across the entire data center. This allows a company's operations team to tune the performance of the entire software stack by shifting resources between applications without having to re-engineer software. Apache Hadoop and Apache Spark together deliver all of the processing power for handling big data. Custom enterprise applications can leverage Hadoop and Spark to deliver the enterprise functionality, while Mesos can balance the resources across the data center. This presentation will focus on an end-to-end use case for the architecture and benefits that can be delivered with this software stack. These benefits will include operational efficiencies, better CPU utilization, and simplified software architectures
In Hadoop in Taiwan 2013 event, engineer of TCloud Computing presented the security concepts and features of Hadoop, how to script Crypto API, configuration details and future development.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
With the introduction of YARN, Hadoop has emerged as a first class citizen in the data center as a single Hadoop cluster can now be used to power multiple applications and hold more data. This advance has also put a spotlight on a need for more comprehensive approach to Hadoop security.
Hortonworks recently acquired Hadoop security company XA Secure to provide a common interface for central administration of security policy and coordinated enforcement across authentication, authorization, audit and data protection for the entire Hadoop stack.
In this presentation, Balaji Ganesan and Bosco Durai (previously with XA Secure, now with Hortonworks) introduce HDP Advanced Security, review a comprehensive set of Hadoop security requirements and demonstrate how HDP Advanced Security addresses them.
Unikernels: in search of a killer app and a killer ecosystemrhatr
By now, unikernels are not a new kid on the block anymore.
There's a healthy diversity of implementations and communities to a point where a project like UniK had to be created to curate it all. This talk will attempt to answer a question of what may be the missing piece to make unikernels as ubiquitous as virtualization or public clouds are today. We will mainly focus on an example of OSv (a popular almost-POSIX unikernel) and its evolution in search of a killer app to run. In addition to that we will attempt to present a vision of an utopian overall ecosystem where unikernels can take its rightful place.
Insight on "From Hadoop to Spark" by Mark KerznerSynerzip
In this talk, the presenter will walk you through a case study of moving from Hadoop to Spark. We will compare Hadoop and Spark side by side and highlight their strong points and disadvantages. And present a balanced assessment of which platform might be better for specific needs.
Read more at https://www.synerzip.com/webinar/from-hadoop-to-spark-webinar-august-19-2015/
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Workshop on Parallel, Cluster and Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India
In Association with
Dept of CSE, VNIT and Persistence System Ltd, Nagpur
Workshop Dates 4th to 6th September 2015
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
This is a course in development. Here is a webinar about it: https://www.youtube.com/watch?v=7vsoZLOtSdY&t=773s.
Our next step is to prepare a "Teacher's Companion" set of slides so that anybody could teach it, to any audience.
Here are some tips on hiring and retaining top Big Data talent. Features : how to source candidates, how to interview them, interview techniques and mistakes.
Listen to video of presentation and download slides here : http://elephantscale.com/2017/03/building-successful-big-data-team-demand-webinar/
Petrophysics and Big Data by Elephant Scale training and consultinelephantscale
Presented at the annual petrophysics software (SPWLA) show in Houston, TX, by Mark Kerzner. How Oil & Gas should approach Big Data, and how Elephant Scale can help in training and implementation.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. From Hadoop to Spark
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
2. HI, I’m Sujee Maniyam
• Founder / Principal @ ElephantScale
• Consulting & Training in Big Data
• Spark / Hadoop / NoSQL /
Data Science
• Author
– “Hadoop illuminated” open source book
– “HBase Design Patterns”
• Open Source contributor: github.com/sujee
• sujee@elephantscale.com
• www.ElephantScale.com
(c) ElephantScale.com 2015
Spark Training
available!
2
3. Webinar Audience
u I am already using Hadoop,
Should I go to Spark?
u I am thinking about Hadoop,
should I skip Hadoop and go to Spark ?
(c) ElephantScale.com 2015 3
4. Webinar Outline
u Intro: what is Hadoop and what is Spark?
u Capabilities and advantages of Spark & Hadoop
u Best use cases for Spark / Hadoop
u From Hadoop to Spark – how to?
Webinar: From Hadoop to Spark(c) ElephantScale.com 2015 4
6. Hadoop in 20 Seconds
u ‘The Original’ Big data platform
u Very well field tested
u Scales to peta-bytes of data
u Enables analytics at massive scale
(c) ElephantScale.com 2015 6
8. Hadoop Ecosystem – by function
u HDFS
– provides distributed storage
u Map Reduce
– Provides distributed computing
u Pig
– High level MapReduce
u Hive
– SQL layer over Hadoop
u HBase
– NoSQL storage for real-time queries
(c) ElephantScale.com 2015 8
10. Hadoop : Use Cases
u Two modes : Batch & Real Time
u Batch use case
– Analytics at large scale (Terra bytes to peta bytes scale)
– Analytics times can be minutes / hours.
Depends on
• Size of data being analyzed
• And type of query
– Examples:
• Large ETL work loads
• “Analyze clickstream data and calculate top page visits”
• “Combine purchase data and click-data and figure out discounts to
apply”
(c) ElephantScale.com 2015 10
11. Hadoop Use Cases
u Real Time Use Cases do not rely on Map Reduce
u Instead we use HBase
– A real-time NoSQL datastore built on Hadoop
u Example : Tracking Sensor data
– Store data from millions of sensor
– Could be billions of data points
– “Find latest reading from a sensor”
– This query must be done in
real time (in milli-seconds)
u “Needle in HayStack” scenarios
– We look for one / few records within
billions
(c) ElephantScale.com 2015 11
14. Big Data Analytics Evolution (v1)
u Decision times : batch ( hours / days)
u Use cases:
– Modeling
– ETL
– Reporting
(c) ElephantScale.com 2015 14
15. Moving Towards Fast Data (v2)
u Decision time : (near) real time
– seconds (or milli seconds)
u Use Cases
– Alerts (medical / security)
– Fraud detection
(c) ElephantScale.com 2015 15
16. Current Big Data Processing Challenges
u Processing needs outpacing 1st generation tools
u Beyond Batch
– Not every one has terra-bytes of data to process
– Small – Medium data sets (few hundred gigs) are more prevalent
– Data may not be on disk
• In memory
• Coming via streaming channels
u MapReduce (MR)’s limitations
– Batch processing doesn't fit all needs
– Not effective for ‘iterative programming’ (machine learning
algorithms ..etc)
– High latency for streaming needs
u Spark is a 2nd generation tool addressing these needs
16(c) ElephantScale.com 2015
17. What is Spark?
u Open source cluster computing engine
– Very fast: In-memory ops 100x faster than MR
• On-disk ops 10x faster than MR
– General purpose: MR, SQL, streaming, machine learning,
analytics
– Compatible: Runs over Hadoop, Mesos, Yarn, standalone
• Works with HDFS, S3, Cassandra, HBase, …
– Easier to code: Word count in 2 lines
u Spark's roots:
– Came out of Berkeley AMP Lab
– Now top-level Apache project
– Version 1.5 released in Sept 2015
“First Big Data platform to integrate batch, streaming and interactive
computations in a unified framework” – stratio.com
(c) ElephantScale.com 2015 17
18. Spark Illustrated
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Standalone YARN MESOS
Cluster
managers
GraphX
Graph
processing
HDFSS3 Cassandra ???
Data
Storage
(c) ElephantScale.com 2015 18
19. Spark Core
u Basic building blocks for distributed compute engine
– Task schedulers and memory management
– Fault recovery (recovers missing pieces on node failure)
– Storage system interfaces
u Defines Spark API and data model
u Data Model: RDD (Resilient Distributed Dataset)
– Distributed collection of items
– Can be worked on in parallel
– Easily created from many data sources (Any HDFS InputSource)
u Spark API: Scala, Python, and Java
– Compact API for working with RDD and interacting with Spark
– Much easier to use than MapReduce API
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201519
20. Spark Components
u Spark SQL: Structured data
– Supports SQL and HQL (Hive Query Language)
– Data sources include Hive tables, JSON, CSV, Parquet (1)
u Spark Streaming: Live streams of data in real-time
– Low latency, high throughput (1000s events / sec)
– Log files, stock ticks, sensor data / IOT (Internet of Things) …
u ML Lib: Machine Learning at scale
– Classification/regression, collaborative filtering …
– Model evaluation and data import
u GraphX: Graph manipulation, graph-parallel computation
– Social network friendships, link data, …
– Graph manipulation and operations and common algorithms
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201520
21. Spark : 'Unified' Stack
u Spark components support multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
u All modules are tightly integrated
– Facilitates rich applications
u Spark can be the only stack you need !
– No need to run multiple clusters (Hadoop cluster, Storm cluster,
etc.)
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201521
29. Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Mostly Java Compact code
Java, Python, Scala supported
No unified shell Shell for ad-hoc exploration
(c) ElephantScale.com 2015 29
30. Spark Is Better Fit for Iterative Workloads
(c) ElephantScale.com 2015 30
32. Is Spark Replacing Hadoop?
u Spark runs on Hadoop / YARN
u Can access data in HDFS
u Use YARN for clustering
u Spark programming model is more flexible than MapReduce
u Spark is really great if data fits in memory (few hundred gigs),
u Spark is ‘storage agnostic’ (see next slide)
(c) ElephantScale.com 2015 32
37. Going from Hadoop to Spark
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
38. Why Move From Hadoop to Spark?
u Spark is ‘easier’ than Hadoop
u ‘friendlier’ for data scientists / analysts
– Interactive shell
• fast development cycles
• adhoc exploration
u API supports multiple languages
– Java, Scala, Python
u Great for small (Gigs) to medium (100s of Gigs) data
(c) ElephantScale.com 2015 38
39. Spark : ‘Unified’ Stack
u Spark supports multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
u All modules are tightly integrated
– Facilitates rich applications
u Spark can be the only stack you need !
– No need to run multiple clusters
(Hadoop cluster, Storm cluster, … etc.)
Image: buymeposters.com
(c) ElephantScale.com 2015 39
40. Migrating From Hadoop à Spark
Functionality Hadoop Spark
Distributed Storage - HDFS
- Cloud storage
(Amazon S3)
- HDFS
- Cloud storage
(Amazon S3)
- Distributed File
system (NFS /
Ceph)
- Distributed NoSQL
(Cassandra)
- Tachyon (in
memory)
SQL querying Hive Spark SQL (Data
frames)
ETL work flow Pig - Spork : Pig on
Spark
- Mix of Spark SQL +
RDD programming
Machine Learning Mahout ML Lib
NoSQL DB HBase ???
(c) ElephantScale.com 2015 40
41. Things to Consider When Moving From
Hadoop to Spark
1. Data size
2. File System
3. Analytics
A. SQL
B. ETL
C. Machine Learning
(c) ElephantScale.com 2015 41
42. Data Size : “You Don’t Have Big Data”
(c) ElephantScale.com 2015 42
43. Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop / Spark
Spark
(c) ElephantScale.com 2015 43
44. Data Size
u Lot of Spark adoption at SMALL – MEDIUM scale
– Good fit
– Data might fit in memory !!
u Applications
– Iterative workloads (Machine learning, etc.)
– Streaming
(c) ElephantScale.com 2015 44
46. Decision : File System
(c) ElephantScale.com 2015 46
“What kind of
file system do I
need for Spark”
47. File System
u Hadoop = Storage + Compute
u Spark = Compute only
u Spark needs a distributed FS
u File system choices for Spark
– HDFS - Hadoop File System
• Reliable
• Good performance (data locality)
• Field tested for PB of data
– S3 : Amazon
• Reliable cloud storage
• Huge scale
– NFS : Network File System (‘shared FS across machines)
– Tachyon (in memory - experimental)
(c) ElephantScale.com 2015 47
49. File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month
(c) ElephantScale.com 2015 49
50. File Systems Throughput Comparison
u Data : 10G + (11.3 G)
u Each file : ~1+ G ( x 10)
u 400 million records total
u Partition size : 128 M
u On HDFS & S3
u Cluster :
– 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
– Hadoop cluster , Horton Works HDP v2.2
– Spark : on same 8 nodes, stand-alone, v 1.2
(c) ElephantScale.com 2015 50
51. HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201551
52. HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201552
53. HDFS Vs. S3 Conclusions
HDFS S3
Data locality à much higher
throughput
Data is streamed à lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain à
convenient
Large data sets (TB + ) Good use case:
- Smallish data sets (few gigs)
- Load once and cache and re-use
(c) ElephantScale.com 2015 53
54. Decision : File Systems
(c) ElephantScale.com 2015 54
Already have
Hadoop?
NO
HDFS
S3
NFS (Ceph)
Cassandra
(real time)
YES
use HDFS
55. Next Decision : SQL
(c) ElephantScale.com 2015 55
“We use SQL heavily for data
mining.
We are using Hive / Impala on
Hadoop.
Is Spark right for us?”
56. SQL in Hadoop / Spark
Hadoop Spark
Engine - Hive (on Map Reduce or
Tez on Hortonworks)
- Impala (Cloudera)
- Spark SQL using
Dataframes
- Hive context
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Terabytes / Petabytes Gigabytes / Terabytes /
Petabytes
Inter operability Data stored in HDFS - Hive tables
- File system
Formats CSV, JSON, Parquet CSV, JSON, Parquet
(c) ElephantScale.com 2015 56
57. Dataframes Vs. RDDs
u RDDs have data
u DataFrames also have schema
u Dataframes Used to be called ‘schemaRDD’
u Unified way to load / save data in multiple formats
u Provides high level operations
– Count / sum / average
– Select columns & filter them
57
(c)
Elephant
Scale.co
m 2015
// load json data
df = sqlContext.read
.format(“json”)
.load(“/data/data.json”)
// save as parquet (faster queries)
df.write
.format(“parquet”)
.saveAsTable(“/data/datap/”)
60. Querying Using SQL
u A DataFrame can be registered as a temporary table
– You can then use SQL to query it, as shown below
– This is handled similarly to DSL queries - building up an AST and
sending it to Catalyst
(c)
Elephant
Scale.co
m 201560 Session 6: Spark SQL
scala> df.registerTempTable("people")
scala> sqlContext.sql("select * from people").show()
name age
John 35
Jane 40
Mike 20
Sue 52
scala> sqlContext.sql("select * from people where age > 35").show()
name age
Jane 40
Sue 52
61. Going From Hive à Spark
u Spark natively supports querying data stored in Hive tables!
u Handy to use in an existing Hadoop cluster !!
61 Session 6: Spark SQL
HIVE
Hive> select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10;
SPARK
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql(
"select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10")
top10.collect()
(c) ElephantScale.com 2015
62. Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on same
HDFS data !
62
63. Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on
same data
on HDFS
63
64. Decision : SQL
Using Hive?
Yes
Spark Using
HiveContext
NO
Spark SQL with
Dataframes
(c) ElephantScale.com 2015 64
65. Next Decision : ETL
(c) ElephantScale.com 2015 65
“we do lot of ETL work on
our Hadoop cluster.
Using tools like Pig /
Cascading
Can we use Spark? “
66. ETL on Hadoop / Spark
ETL Hadoop Spark
ETL Tools Pig, Cascading, Oozie - Native RDD
programming
(Scala, Java,
Python)
- Cascading?
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding
Cask Works Works
(c) ElephantScale.com 2015 66
67. Data Transformation on Spark
u Dataframes are great for high level manipulation of data
– High level operations : Join / Union …etc
– Joining / Merging disparate data sets
– Can read and understand multitude of data formats (JSON /
Parquet ..etc)
– Very easy to program
u RDD APIs allow low level programming
– Complex manipulations
– Lookups
– Supports multiple lanaguages (Java / Scala / Python)
u High level libraries are emerging
– Tresata
– CASK
(c) ElephantScale.com 2015 67
68. Decisions : ETL
Current ETL
Pig
Spork RDD API
Dataframes
Cascading
Cascading on
Spark
RDD
Data frames
Java
MapReduce
/ Custom
RDD
Dataframes
(c) ElephantScale.com 2015 68
69. Decision : Machine Learning
(c) ElephantScale.com 2015 69
Can we use Spark for
Machine Learning?
YES
70. Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions
(c) ElephantScale.com 2015 70
71. Decision : In Memory Process
(c) ElephantScale.com 2015 71
How can we do in-
memory processing
using Spark?
72. Numbers Every One Should Know by Jeff
Dean, Fellow @ Google
Operation Cost (in nano seconds)
L1 cache reference 0.5
Branch mispredict (cpu) 5
L2 cache reference 7
Mutex lock/unlock 100
Main memory reference 100
Compress 1K bytes with Zippy 10,000
Send 2K bytes over 1 Gbps network 20,000
Read 1 MB sequentially from memory 250,000 0.25 ms
Round trip within same datacenter 500,000 0.5 ms
Disk seek 10,000,000 10 ms
Read 1 MB sequentially from network 10,000,000 10 ms
Read 1 MB sequentially from disk 30,000,000 30 ms
Send packet CA->Netherlands->CA 150,000,000 150 ms
(c) ElephantScale.com 2015 72
73. Spark Caching
u Caching is pretty effective (small / medium data sets)
u Cached data can not be shared across applications
(each application executes in its own sandbox)
(c) ElephantScale.com 2015 73
76. Sharing Cached Data
u By default Spark applications can not share cached data
– Running in isolation
u 1) ‘spark job server’
– Multiplexer
– All requests are executed through same ‘context’
– Provides web-service interface
u 2) Tachyon
– Distributed In-memory file system
– Memory is the new disk!
– Out of AMP lab , Berkeley
– Early stages (very promising)
(c) ElephantScale.com 2015 76
78. Spark Job Server
u Open sourced from Ooyala
u ‘Spark as a Service’ – simple REST interface to launch jobs
u Sub-second latency !
u Pre-load jars for even faster spinup
u Share cached RDDs across requests (NamedRDD)
u https://github.com/spark-jobserver/spark-jobserver
(c) ElephantScale.com 2015 78
App1 :
sharedCtx.saveRDD(“my cached rdd”, rdd1)
App2:
RDD rdd2 = sharedCtx.loadRDD (“my cached rdd”)
80. How to Get Spark?
Session 2: Introduction to Spark
81. Getting Spark
(c) ElephantScale.com 2015 81
Running Hadoop?
NO
Need HDFS?
YES
Install HDFS +
YARN + Spark
NO
Environment ?
Production
Spark + Mesos +
S3
Testing
Spark (standalone)
+ NFS or S3
YES
Install Spark on
Hadoop cluster
82. Spark Cluster Setup 1 : Simple
u Great for POCs / experimentation
u No dependencies
u Using Spark’s ‘stand alone’ manager
(c) ElephantScale.com 2015 82
83. Spark Cluster Setup 2 : Production
u Works well with Hadoop eco system (HDFS / Hive ..etc)
u Best way to adopt Spark on Hadoop
u Uses YARN as cluster manager
(c) ElephantScale.com 2015 83
84. Spark Cluster Setup 3 : Production
u Uses Mesos as cluster manager
(c) ElephantScale.com 2015 84
86. Use Case 1 : Moving to Cloud
(c) ElephantScale.com 2015 86
87. Use Case 1 : Lessons Learned
u Size
– Small Hadoop cluster (8 nodes)
– Smallish data : 50G – 300G
– Data for processing : few Gigs per query
u Good !
– Only one moving part 'spark'
– No Hadoop cluster to maintain
– S3 was a dependable storage (passive)
– Query response time gone from minutes to seconds (b/c we went
from MR à Spark)
u Not so good
– We lost data locality of HDFS
(ok for small/medium data sets)
(c) ElephantScale.com 2015 87
88. Use Case 2 : Persistent Caching in Spark
u Can we improve latency in this setup?
u Caching will help
u How ever, in Spark cached data can not be shared across
applications L
(c) ElephantScale.com 2015 88
89. Use Case 2 : Persistent Caching in Spark
u Spark Job Server to rescue !
(c) ElephantScale.com 2015 89
90. Final Thoughts
u Already on Hadoop?
– Try Spark side-by-side
– Process some data in HDFS
– Try Spark SQL for Hive tables
u Contemplating Hadoop?
– Try Spark (standalone)
– Choose NFS or S3 file system
u Take advantage of caching
– Iterative loads
– Spark Job server
– Tachyon
(c) ElephantScale.com 2015 90
91. Thanks and questions?
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Sign up for upcoming trainings : ElephantScale.com/training
(c) ElephantScale.com 2015 91