"Apache Spark™ is a fast and general engine for large-scale data processing."" Above statement is taken from Apache Spark welcome page. It's one of those definitions that, while describing the product in one sentence and being 100 % true, tell still little to the wondering noob.
Why take interest in Apache Spark? Apache Spark promise being up to 100x faster than Hadoop MapReduce in certain scenarios. It provide comprehensible programming model (familiar to everyone who is used to functional programming) and vast ecosystem of tools.
In my talk I will try to reveal secrets of Apache Spark for the very beginners.
We will do first quick introduction to the set of problems commonly known as BigData: what they try to solve, what are their obstacles and challenges and how those can be addressed. We will quickly take a pick on MapReduce: theory and implementation. We will then move to Apache Spark. We will see what was the main factor that drove its creators to introduce yet another large-scala processing engine. We will see how it works, what are its main advantages. Presentation will be mix of slides and code examples.
Does Google still need links? - SearchLove San Diego 2017Tom Capper
Back in Google's early days, people navigated the web using links, and this made PageRank an excellent proxy for popularity and authority. The web is moving away from primarily link based surfing, and Google no longer needs a proxy - so what, in 2017, is the point in links?
The Sourcecon webinar slides delivered by Andy Headworth from http://sironaconsulting.com/ on 22nd October 2014. It is about using Twitter and Google Plus to source candidates.
It covers sourcing individuals on both Google+ and Twitter as well as sourcing candidates from Communities and Twitter Lists.
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France.
ABSTRACT
Logging user activities is essential to data analysis for internet products and services.
Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization.
This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights.
In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types.
Two interactive visualizations were developed for these purposes:
we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.
New information social networking - nice 2011John Mayfield
This was a presentation for FNAIM in Nice, France, to explain why Social Media is important to consider as a real estate agent, and ways to implement social media in their businesses.
Enterprise SEO - Pain Management Strategiespetryshen
Achieving success in SEO can be challenging at the best of times. At the enterprise level, multiple stakeholders, legacy technology and unknown hurdles can make the road to SEO success long and difficult. Once you get there, the last thing you want to do is to fall off the radar. Tom's share's his SEO pain management strategies to stay on top.
Does Google still need links? - SearchLove San Diego 2017Tom Capper
Back in Google's early days, people navigated the web using links, and this made PageRank an excellent proxy for popularity and authority. The web is moving away from primarily link based surfing, and Google no longer needs a proxy - so what, in 2017, is the point in links?
The Sourcecon webinar slides delivered by Andy Headworth from http://sironaconsulting.com/ on 22nd October 2014. It is about using Twitter and Google Plus to source candidates.
It covers sourcing individuals on both Google+ and Twitter as well as sourcing candidates from Communities and Twitter Lists.
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France.
ABSTRACT
Logging user activities is essential to data analysis for internet products and services.
Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization.
This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights.
In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types.
Two interactive visualizations were developed for these purposes:
we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.
New information social networking - nice 2011John Mayfield
This was a presentation for FNAIM in Nice, France, to explain why Social Media is important to consider as a real estate agent, and ways to implement social media in their businesses.
Enterprise SEO - Pain Management Strategiespetryshen
Achieving success in SEO can be challenging at the best of times. At the enterprise level, multiple stakeholders, legacy technology and unknown hurdles can make the road to SEO success long and difficult. Once you get there, the last thing you want to do is to fall off the radar. Tom's share's his SEO pain management strategies to stay on top.
If you've avoided created social media profiles because you were hoping it was a passing fad or if you created them but have no idea what to do now, this presentation is for you. You'll learn the essentials you need to optimize social media for business and personal use and control the digital fingerprints you leave behind.
Learner objectives:
- Understand the difference between different social media platforms.
- Learn what content is best to post and share on each platform.
- Identify best practices for professional and personal use.
For more business-friendly advice, especially for admins and event planners, visit http://planyourmeetings.com. Like what you see? Subscriptions are free!
A complete guide to the best times to post on social media (and more!)Marketing Wallah
Do you know the most effective times to post on social media, send an email, or publish a blog? We've broken down the data behind the most effective times to post content on Twitter, Instagram, Facebook, Content Marketing, and Email.
Congrats! You're being social. Now what?Managing multiple profiles can be overwhelming and more than a little intimidating. If you're not seeing any return on the time you're spending on social platforms or if you're not sure of next steps, this session is for you. You'll learn about tools that will help you minimize the amount of time you're spending on social media, while maximizing the size and engagement level of your audience.
This presentation was originally created for the 2015 IAAP Georgia-Alabama Branch Event.
For more business advice, best practices and time-managment tips, visit http://planyourmeetings.com. Like what you see? Subscriptions are free.
SearchLove San Diego 2018 | Ashley Ward | Reuse, Recycle: How to Repurpose Yo...Distilled
Creating content that can be reused is an effective way to extend the life of your content, increase its views, and reach your content marketing goals. Ashley will be demonstrating how to find the content which should be reused, the rules to follow when reusing your content, and how to analyze the effectiveness of this recycled content and which tools will best help us find ROI.
You can expect to walk out of this content session with a strategic plan to take back to your office with on how to recycle your content at low costs and achieve high ROI.
We're told on a regular basis to monitor the performance, speed, responsiveness, memory, and general health of our websites, with the ever-present threat of down time hanging over our shoulder. But how often do we pay this same attention to our own physical and mental health?
As a Type 1 Diabetic, it's a little more front-of-mind, as it's not just about how much exercise I've gotten in the last month, how healthy my diet is, or how much of a workaholic I am... It's about what the ratio of sugar to insulin is in my bloodstream at every moment of every day. It's about making sure I've got a spare insulin pod, my test machine, a granola bar, glucose tabs, and my trusty sidekick Ember Dog (with all of his accouterment) at all times.
But just because I have to be more aware of certain things doesn't lessen the importance of paying attention to general physical and mental health, which come with their own set of potentially deadly side effects. In this Ignite talk, I'll touch briefly on my day-to-day life with diabetes, and then segue into what the past two years have taught me about mental and physical health.
Visualization tool used to see if ai-one's biologically inspired computing can discern meaningful associations in the mess of tweets from a technical conference. This capability serves as the foundation for building intelligent agents and other applications allowing human interpretation of large data sets.
Voice search is fast becoming part of our broader organic search ecosystem. Discover how organic data, including Featured Snippets, feed voice and represent new SEO opportunities.
We start by very briefly introducing the Twitter platform and detailing the demographics of the users and the biases they introduce. The relationship between geography, mobility and social network properties will be described using the Twitter service as a case study. Finally, tutorial attendees will get the chance to review the most seminal works in the area where spatial and geographic perspectives are highlighted.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
If you've avoided created social media profiles because you were hoping it was a passing fad or if you created them but have no idea what to do now, this presentation is for you. You'll learn the essentials you need to optimize social media for business and personal use and control the digital fingerprints you leave behind.
Learner objectives:
- Understand the difference between different social media platforms.
- Learn what content is best to post and share on each platform.
- Identify best practices for professional and personal use.
For more business-friendly advice, especially for admins and event planners, visit http://planyourmeetings.com. Like what you see? Subscriptions are free!
A complete guide to the best times to post on social media (and more!)Marketing Wallah
Do you know the most effective times to post on social media, send an email, or publish a blog? We've broken down the data behind the most effective times to post content on Twitter, Instagram, Facebook, Content Marketing, and Email.
Congrats! You're being social. Now what?Managing multiple profiles can be overwhelming and more than a little intimidating. If you're not seeing any return on the time you're spending on social platforms or if you're not sure of next steps, this session is for you. You'll learn about tools that will help you minimize the amount of time you're spending on social media, while maximizing the size and engagement level of your audience.
This presentation was originally created for the 2015 IAAP Georgia-Alabama Branch Event.
For more business advice, best practices and time-managment tips, visit http://planyourmeetings.com. Like what you see? Subscriptions are free.
SearchLove San Diego 2018 | Ashley Ward | Reuse, Recycle: How to Repurpose Yo...Distilled
Creating content that can be reused is an effective way to extend the life of your content, increase its views, and reach your content marketing goals. Ashley will be demonstrating how to find the content which should be reused, the rules to follow when reusing your content, and how to analyze the effectiveness of this recycled content and which tools will best help us find ROI.
You can expect to walk out of this content session with a strategic plan to take back to your office with on how to recycle your content at low costs and achieve high ROI.
We're told on a regular basis to monitor the performance, speed, responsiveness, memory, and general health of our websites, with the ever-present threat of down time hanging over our shoulder. But how often do we pay this same attention to our own physical and mental health?
As a Type 1 Diabetic, it's a little more front-of-mind, as it's not just about how much exercise I've gotten in the last month, how healthy my diet is, or how much of a workaholic I am... It's about what the ratio of sugar to insulin is in my bloodstream at every moment of every day. It's about making sure I've got a spare insulin pod, my test machine, a granola bar, glucose tabs, and my trusty sidekick Ember Dog (with all of his accouterment) at all times.
But just because I have to be more aware of certain things doesn't lessen the importance of paying attention to general physical and mental health, which come with their own set of potentially deadly side effects. In this Ignite talk, I'll touch briefly on my day-to-day life with diabetes, and then segue into what the past two years have taught me about mental and physical health.
Visualization tool used to see if ai-one's biologically inspired computing can discern meaningful associations in the mess of tweets from a technical conference. This capability serves as the foundation for building intelligent agents and other applications allowing human interpretation of large data sets.
Voice search is fast becoming part of our broader organic search ecosystem. Discover how organic data, including Featured Snippets, feed voice and represent new SEO opportunities.
We start by very briefly introducing the Twitter platform and detailing the demographics of the users and the biases they introduce. The relationship between geography, mobility and social network properties will be described using the Twitter service as a case study. Finally, tutorial attendees will get the chance to review the most seminal works in the area where spatial and geographic perspectives are highlighted.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This presentation summarizes multiple screen development difficulties, optimizations for different kinds of devices and screen sizes and gives best practices to handle multi screen problems in Android.
Redington Value is the Value Added Distribution division of Redington Gulf, the largest distributor of IT products in Middle East and Africa. Redington Value helps it’s partners in the channel deliver the most optimal IT solution to their customers in Middle East and Africa. These solutions span across technology domains such as Networking, Voice, Servers, Storage, Software, Security and Infrastructure.
Sukrit magazine a complete guidance towards Human science and their life, identity, spirituality, stressful job condition, health, hygiene and many more.
The perfect remedies for omission of present stressful life of human being.
The key to spiritual ascension and good living.
Explore the eternal knowledge, self-identity, objective of human life, science, health & professional guidance through the latest issue of Sukrit. This magazine is the best & unique tool for a perfect human life.
Automotive Troubleshooting With An Oscilloscope.Jeffrey Bledsoe
A local auto repair shop made several bad guesses at a church van's intermittent ignition issue, with costs totaling about $1,600. Volunteering to determine the cause of the problem, I instrumented engine ignition signals, set the oscilloscope to trigger on engine shutdown, and drove the van around town for about 3 months until the failure mechanism was revealed.
Novel machine learning techniques comes from spending time with people that have distinct needs. This talk addresses how listening to end users can give rise to novel machine learning applications.
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
Bring your ideas to life! Convince your boss to that open source development is faster and cheaper than the "safe" COTS solution they probably hate anyway. Let's investigate ways to get real-life, functional prototypes up with blazing speed. We'll look at and compare tools for truly rapid development including Python, Django, Flask, PHP, Amazon EC2 and Heroku.
DN18 | A/B Testing: Lessons Learned | Dan McKinley | MailchimpDataconomy Media
Abstract about the Presemtation:
Introducing A/B testing to a large team that has never done it before is a weird and bewildering thing that Dan McKinley has somehow done twice. This has burdened him with many opinions about how to achieve this with minimal wailing and gnashing of teeth.
About the Author:
Dan McKinley is a Co-Founder of Skyliner in Los Angeles. Previously he worked at Stripe and spent nearly 7 years building Etsy, during which he worked on “pretty much every feature and backend facility on the site”. He resides in LA with his wife and son.
LESSON 3B. FOCUS: FOR LOOPS, NESTED LOOPS, TASKS AND CHALLENGES.
Introduction to, with examples, For loops. Challenges and tasks included with solutions (predict the output). Compare ‘while’ and ‘for’ loops. Use the break statement and explore how it works in different scenarios. Learn about Nested Loops. Learn about the need for initialisation (set starting value). Create your own for loops. Create the beginnings of an arithmetic quiz using a random function and for loops. Big ideas discussion: Is the universe digital. A program? Introducing Gottfried Leibniz and Konrad Zuse. Includes a suggested videos, ‘Big ideas’ discussion, and HW/research projects section.
From list sorting to network routing, and from hash tables to capacity planning, a programmer's daily work is filled with probability. We use probabilistic algorithms, data structures, and systems constantly often without even thinking about it. Experienced engineers reach for probabilistic algorithms frequently and intentionally, especially when building systems of serious scale. How do probabilistic algorithms actually work in practice? And how do we know they'll be safe and reliable in our critical production systems? We'll address those questions, explore a few algorithms, and see why "with high probability" is often better than "exactly".
Replication in Data Science - A Dance Between Data Science & Machine Learning...June Andrews
We use Iterative Supervised Clustering as a simple building block for exploring Pinterest's Content. But simplicity can unlock great power and with this building block we show the shocking result of how hard it is to replicated data science conclusions. This begs us to challenge the future for When is Data Science a House of Cards?
Computing Social Score of Web AritfactsVenkatesh J N
We propose an approach which computes a single aggregate score of an artifact that reflects the popularity across different social media sites and not just limited to any particular site.
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more.
Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook!
We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.
Maintainable Software Architecture in Haskell (with Polysemy)Pawel Szulc
Target audience:
Developers who are interested in seeing Haskell being used as a general programming language. Engineers who are hoping to see Haskell as an environment in which they can quickly and effectively iterate between requirements, design and running executable code: providing value to the business with an immediate feedback loop.
Topics explored:
software architecture
software testability and maintainability
Free Monads
Polysemy library
Target audience:
Developers who are interested in seeing Haskell being used as a general programming language. Engineers who are hoping to see Haskell as an environment in which they can quickly and effectively iterate between requirements, design and running executable code: providing value to the business with an immediate feedback loop. Developers who are eager to see how they can rapidly create software that is flexible to changes, extensible and testable.
Pre-requirements:
- basic understanding of Haskell syntax (functions, composition, do-notation, type classes)
- basic FP building blocks: functor, monad
- (optionally) some previous exposure to free monads
- (optionally) scars and war stories of using tagless final in Haskell
Tech stack we will explore:
- polysemy
- servant
It's happened to all of us: we ran away from some conversation or library because it kept on using those "weird" phrases. You know, like "type classes", "semigroups", "monoids", "applicatives". Yikes! They all seem so academic, so pointlessly detached from real-world problems. But then again, given how frequently we run into them in functional programming, are they REALLY irrelevant, or do they have real-world applications? This talk will go beyond giving you raw definitions of these terms, and show you real-world motivations behind the concepts. By attending, you'll be able to keep your skills relevant to an ever-changing industry, confuse your significant other ("You know, honey, a monad is just a monoid in the category of endofunctors!"), and sound extra smart on the next job interview!
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
14. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Big Data is like...
“Big Data is like teenage sex: everyone
talks about it, nobody really knows how
to do it, everyone thinks everyone else
is doing it, so everyone claims they are
doing it”
21. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Big Data is all about...
● well, the data :)
● It is said that 2.5 exabytes (2.5×10^18) of
data is being created around the world every
single day
● It's a capacity on which you can not any
longer use standard tools and methods of
evaluation
25. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
To the rescue
MAP REDUCE
“'MapReduce' is a framework for processing
parallelizable problems across huge datasets
using a cluster, taking into consideration
scalability and fault-tolerance”
32. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Map Reduce - key/value
“In MapReduce, no value stands on its own.
Every value has a key associated with it. Keys
identify related values.
The mapping and reducing functions receive
not just values, but (key, value) pairs. The
output of each of these functions is the same:
both a key and a value.”
33. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word Count
● The “Hello World” of Big Data world.
● For initial input of multiple lines, extract all
words with number of occurrences
To be or not to be
Let it be
Be me
It must be
Let it be
be 7
to 2
let 2
or 1
not 1
must 2
me 1
36. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
37. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
38. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling Reducing
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
be 6
to 2
or 1
not 1
let 2
must 1
me 1
39. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling Reducing Final result
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
be 6
to 2
or 1
not 1
let 2
must 1
me 1
be 6
to 2
let 2
or 1
not 1
must 2
me 1
41. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word count - pseudo-code
function map(String name, String document):
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator
partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
45. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word count - revisited
function map(String name, String document):
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator
partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
46. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word count: Hadoop implementation
15 public class WordCount { 16
17 public static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text(); 20
21 public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable,
Text, IntWritable> {
33 public void reduce(Text key, Iterable<IntWritable> values,
Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) { sum += val.get(); }
39 context.write(key, new IntWritable(sum));
40 }
41 }
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
46 Job job = new Job(conf, "wordcount");
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
54 job.setInputFormatClass(TextInputFormat.class);
60. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Performance issues
● Map-Reduce pair combination
● Output saved to the file
● Iterative algorithms go through IO path again
and again
● Poor API (key, value), even basic join
requires expensive code
62. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Problems with Map Reduce
1. MapReduce provides a difficult programming
model for developers
2. It suffers from a number of performance
issues
3. While batch-mode analysis is still important,
reacting to events as they arrive has become
more important (lack support of “almost”
real-time)
101. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records).
102. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes.
103. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes.
104. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes. This means that Spark sorted
the same data 3X faster using 10X fewer machines.
105. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes. This means that Spark sorted
the same data 3X faster using 10X fewer machines. All (...)
without using Spark’s in-memory cache.”
117. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
118. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
Executor 1
Executor 2
Executor 3
119. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
Executor 1
Executor 2
Executor 3
HDFS,
GlusterFS,
locality
120. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
121. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
122. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
123. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2T3
124. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
125. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
126. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
127. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
128. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
129. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
130. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1 T2
T3
131. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1 T2 T3
132. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
133. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
134. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2T3
135. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
136. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
137. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
138. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
139. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
140. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
EDeDEADutor
3
T1
T2
141. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
142. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2 T3
143. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
144. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
145. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
146. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
147. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2 T3
151. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - the definition
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
152. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - the definition
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
Dataset - initial data comes from a file
or can be created programmatically
164. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
And yet another RDD
Performance Alert?!?!
165. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - Operations
1. Transformations
a. Map
b. Filter
c. FlatMap
d. Sample
e. Union
f. Intersect
g. Distinct
h. GroupByKey
i. ….
2. Actions
a. Reduce
b. Collect
c. Count
d. First
e. Take(n)
f. TakeSample
g. SaveAsTextFile
h. ….
167. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
168. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will trigger the
computation
169. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will trigger the
computation
This will the calculated
value (Int)
174. Why Spark Streaming
A need to process data in almost real-time
● monitoring
● web logs analysis
● fraud detection
● online ads
175. Why Spark Streaming
A need to process data in almost real-time
● monitoring
● web logs analysis
● fraud detection
● online ads
Problem: no framework to do both batch &
stream processing
177. How Spark Streaming works?
Spark Streaming
RDD
RDD
RDD
live streamed data
small RDDs
178. How Spark Streaming works?
Spark Streaming
RDD
RDD
RDD
Spark Core
live streamed data
small RDDs
output data
179. Spark Streaming - Usage
val ssc = new StreamingContext(conf, Seconds(1))
Similar to SparkContext,
we need to have an entry
point for the new API
180. Spark Streaming - Usage
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
DStream is created (think
of it as streamed RDD)
181. Spark Streaming - Usage
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
Exact same API as for
RDD
182. Spark Streaming - Usage
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start()