Real time analytics with Spark Streaming by Padma at Bangalore I & D meetup (https://www.meetup.com/Bengaluru-Insights-and-Data-Meetup/events/238459154)
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Big Data Ppt PowerPoint Presentation Slides SlideTeam
Big data has brought about a revolution in the field of information technology. Our content-ready big data PPT PowerPoint presentation slides shed light on the importance and relevance of large volumes of data. The data management presentation covers myriad of topics such as big data sources, market forecast, 3 Vs, technologies, workflow, data analytics process, impact, benefit, future, opportunity and challenges, and many additional slides containing graphs and charts. The biggest benefit that this big data analytics presentation template offers is that it enables you to unearth the information that can be used to shape the future of your business. Moreover, these designs can also be utilized to craft your own presentation on predictive analytics, data processing application, database, cloud computing, business intelligence, and user behavior analytics. Download big data PPT visuals which will help you make accurate business decisions. Enlighten folks on fraud with our Big Data PPt PowerPoint Presentation Slides. Convince them to be highly alert.
The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and show the benefits and challenges of a new architectural paradigm that decouples the resource management component (YARN) from the specifics of the application frameworks (e.g., MapReduce, Tez, REEF, Giraph, Naiad, Dryad, Spark,...). We argue that beside the primary goals of increasing scalability and programming model flexibility, this transformation dramatically facilitates innovation.
In this context, I will present some of our contributions to the evolution of Hadoop (namely: work-preserving preemption, and predictable resource allocation), and comment on the fascinating experience of working on open- source technologies from within Microsoft. The current Hadoop APIs (HDFS and YARN) provide the cluster equivalent of an OS API. With this as a backdrop, I will present our attempt to create the equivalent of stdlib for the cluster: the REEF project.
Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
For many IT experts, big data analytics tools and technologies are now a top priority. Let's find out the top big data analytics tools in this slide to initialize and advance the process of big data analysis.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Big Data Ppt PowerPoint Presentation Slides SlideTeam
Big data has brought about a revolution in the field of information technology. Our content-ready big data PPT PowerPoint presentation slides shed light on the importance and relevance of large volumes of data. The data management presentation covers myriad of topics such as big data sources, market forecast, 3 Vs, technologies, workflow, data analytics process, impact, benefit, future, opportunity and challenges, and many additional slides containing graphs and charts. The biggest benefit that this big data analytics presentation template offers is that it enables you to unearth the information that can be used to shape the future of your business. Moreover, these designs can also be utilized to craft your own presentation on predictive analytics, data processing application, database, cloud computing, business intelligence, and user behavior analytics. Download big data PPT visuals which will help you make accurate business decisions. Enlighten folks on fraud with our Big Data PPt PowerPoint Presentation Slides. Convince them to be highly alert.
The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and show the benefits and challenges of a new architectural paradigm that decouples the resource management component (YARN) from the specifics of the application frameworks (e.g., MapReduce, Tez, REEF, Giraph, Naiad, Dryad, Spark,...). We argue that beside the primary goals of increasing scalability and programming model flexibility, this transformation dramatically facilitates innovation.
In this context, I will present some of our contributions to the evolution of Hadoop (namely: work-preserving preemption, and predictable resource allocation), and comment on the fascinating experience of working on open- source technologies from within Microsoft. The current Hadoop APIs (HDFS and YARN) provide the cluster equivalent of an OS API. With this as a backdrop, I will present our attempt to create the equivalent of stdlib for the cluster: the REEF project.
Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
For many IT experts, big data analytics tools and technologies are now a top priority. Let's find out the top big data analytics tools in this slide to initialize and advance the process of big data analysis.
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
The Briefing Room with Dr. Robin Bloor, Trifacta and Zoomdata
Live Webcast March 10, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=dd9fed3c7c476ae3a0f881ae6b53dcc5
Square pegs and round holes don't get along, which is one reason why traditional data management approaches simply won't work for Big Data. The variety and velocity of data types flying at us today require a new strategy for identifying, streamlining and utilizing information assets and processes. Decades-old technology won’t cut it – a combination of new tools and techniques must be used to enable effective discovery of insights in a timely fashion.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain why today's data landscape calls for a much different data management approach. He'll be briefed by Trifacta and Zoomdata, who will show how their technologies use a range of functionality – including machine learning – to help companies "wrangle" their data. They'll also demonstrate the optimal step-by-step process of working with new data types.
Visit InsideAnalysis.com for more information.
The (very) basics of AI for the Radiology residentPedro Staziaki
The (very) basics of AI for the Radiology resident.
Also on YouTube: https://youtu.be/ia90UKjlmBA
Artificial Intelligence, Machine Learning, Deep Learning, CNN, Convolutional Neural Networks, Support Vector Machine (SVM), GPU. Felipe Kitamura. Pedro Vinícius Staziaki.
In this talk, we introduce the Data Scientist role , differentiate investigative and operational analytics, and demonstrate a complete Data Science process using Python ecosystem tools, like IPython Notebook, Pandas, Matplotlib, NumPy, SciPy and Scikit-learn. We also touch the usage of Python in Big Data context, using Hadoop and Spark.
The role of data engineering in data science and analytics practiceJoseph Benjamin Ilagan
The role of data engineering in data science and analytics practice. Presented in the Philippine Software Industry Association (PSIA) 40th Enablement Seminar.
Slides template by Slides Carnival (https://www.slidescarnival.com/)
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.
In this session you will learn:
* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment
Advanced Analytics and Data Science ExpertiseSoftServe
An overview of SoftServe's Data Science service line.
- Data Science Group
- Data Science Offerings for Business
- Machine Learning Overview
- AI & Deep Learning Case Studies
- Big Data & Analytics Case Studies
Visit our website to learn more: http://www.softserveinc.com/en-us/
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Introduction to Data Science (Data Summit, 2017)Caserta
At DBTA's 2017 Data Summit in New York, NY, Caserta Founder & President, Joe Caserta, and Senior Architect, Bill Walrond, gave a pre-conference workshop presenting the ins and outs of data science. Data scientist has been dubbed the "sexiest" job of the 21st century, but it requires an understanding of many different elements of data analysis. This presentation dives into the fundamentals of data exploration, mining, and preparation, applying the principles of statistical modeling and data visualization in real-world applications.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
The Briefing Room with Dr. Robin Bloor, Trifacta and Zoomdata
Live Webcast March 10, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=dd9fed3c7c476ae3a0f881ae6b53dcc5
Square pegs and round holes don't get along, which is one reason why traditional data management approaches simply won't work for Big Data. The variety and velocity of data types flying at us today require a new strategy for identifying, streamlining and utilizing information assets and processes. Decades-old technology won’t cut it – a combination of new tools and techniques must be used to enable effective discovery of insights in a timely fashion.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain why today's data landscape calls for a much different data management approach. He'll be briefed by Trifacta and Zoomdata, who will show how their technologies use a range of functionality – including machine learning – to help companies "wrangle" their data. They'll also demonstrate the optimal step-by-step process of working with new data types.
Visit InsideAnalysis.com for more information.
The (very) basics of AI for the Radiology residentPedro Staziaki
The (very) basics of AI for the Radiology resident.
Also on YouTube: https://youtu.be/ia90UKjlmBA
Artificial Intelligence, Machine Learning, Deep Learning, CNN, Convolutional Neural Networks, Support Vector Machine (SVM), GPU. Felipe Kitamura. Pedro Vinícius Staziaki.
In this talk, we introduce the Data Scientist role , differentiate investigative and operational analytics, and demonstrate a complete Data Science process using Python ecosystem tools, like IPython Notebook, Pandas, Matplotlib, NumPy, SciPy and Scikit-learn. We also touch the usage of Python in Big Data context, using Hadoop and Spark.
The role of data engineering in data science and analytics practiceJoseph Benjamin Ilagan
The role of data engineering in data science and analytics practice. Presented in the Philippine Software Industry Association (PSIA) 40th Enablement Seminar.
Slides template by Slides Carnival (https://www.slidescarnival.com/)
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.
In this session you will learn:
* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment
Advanced Analytics and Data Science ExpertiseSoftServe
An overview of SoftServe's Data Science service line.
- Data Science Group
- Data Science Offerings for Business
- Machine Learning Overview
- AI & Deep Learning Case Studies
- Big Data & Analytics Case Studies
Visit our website to learn more: http://www.softserveinc.com/en-us/
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Introduction to Data Science (Data Summit, 2017)Caserta
At DBTA's 2017 Data Summit in New York, NY, Caserta Founder & President, Joe Caserta, and Senior Architect, Bill Walrond, gave a pre-conference workshop presenting the ins and outs of data science. Data scientist has been dubbed the "sexiest" job of the 21st century, but it requires an understanding of many different elements of data analysis. This presentation dives into the fundamentals of data exploration, mining, and preparation, applying the principles of statistical modeling and data visualization in real-world applications.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
• Explored and cleaned huge amount of user activity logs (JSON) from Movies website using Map Reduce jobs in Python.
• Classified user accounts into adults and children for targeted advertising by implementing Similarity Ranking algorithm.
• Grouped user sessions based on user behavior using K means clustering to observe outliers and to find distinctive groups.
• Predicted ratings for movies using User-user and Item-Item based recommendation algorithms using Mahout.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale. FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.
"A session in the DevNet Zone at Cisco Live, Berlin. Analytics of network telemetry data (such as flow records, IPSLA measurements, and time series of MIB data) helps address many important operational problems. Traditional Big Data approaches run into limitations even as they push scale boundaries for processing data further. One reason for this is the fact that in many cases, the bottleneck for analytics is not analytics processing itself but the generation and export of the data on which analytics depends. Data does not come for free. The amount of data that can be reasonably collected from the network runs into inherent limitations due to bandwidth and processing constraints in the network itself. In addition, management tasks related to determining and configuring which data to generate lead to significant deployment challenges.
This presentation provides an overview of DNA (Distributed Network Analytics), a novel technology to analyze network telemetry data in distributed fashion at the network edge, allowing users to detect changes, predict trends, recognize anomalies, and identify hotspots in their network. Analytics processing occurs at the source of the data using an embedded DNA Agent App that dynamically configures data sources as needed and analyzes the data using an embedded analytics engine. This provides DNA with superior scaling characteristics while avoiding the significant operational and bandwidth overhead that is associated with centralized analytics solutions. An ODL-based SDN controller application orchestrates network analytics tasks across the network, providing a network analytics service that allows users to interact with the network as a whole instead of individual devices one at a time. DNA is enabled by the IOx App Hosting Framework and integrated with light-weight embedded analytics engines, CSA (Connected Service Analytics) and DMO (Data in Motion). "
Deep dive into spark streaming, topics include:
1. Spark Streaming Introduction
2. Computing Model in Spark Streaming
3. System Model & Architecture
4. Fault-tolerance, Check pointing
5. Comb on Spark Streaming
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2
Abundant data is all around. The most important aspect is how you as an organization can access the data, process it, and present information to the relevant authorities on time. To gain competitive advantage the means of accessing, processing and presenting the data should be optimal, highly available and scalable.
In this talk, we will discuss how you can leverage WSO2 Data Analytics Server, WSO2 IoT Server, WSO2 Enterprise Service Bus and other WSO2 products in order to analyze the data. We will also discuss different deployment patterns that can provide you with a suitable solution that lets you analyze relevant data historically, in real-time or interactively and predict future states to make better decisions for your organization’s success.
Presented the hands-on session on “Introduction to Big Data Analysis” at Dayananda Sagar University. Around 150+ University students benefitted from this session.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
2. About me
• Padma Chitturi
• Analytics Lead @ Fractal Analytics
• Author of “Apache Spark for
DataScience cookbook”
• https://github.com/ChitturiPadma/S
parkforDataScienceCookbook
4. Need for “Real Time”Analytics across Industries
Fraud detection Connected Car Data Identity &
Protection Services
Click Stream
Analysis
Financial Sales
Tracking
Improving Patient-
Care
5. Overview of Spark
• In-memory cluster computing framework for processing
and analyzing large volumes of data.
• Key Features:
• Easy to use ( expressive API for batch & real-time processing
• Fast (provides in-memory persisting and optimizes disk seeks)
• General-purpose (support batch, real-time and graph processing).
• Scalable (as the data grows, computational power can be
increased by adding more nodes).
• Fault-tolerant (handles node failures without interrupting the
application by launching tasks on the nodes having replicated
copy)
6. What is Spark Streaming ?
• Extends Spark for doing large scale stream processing
• Scales to 100s of nodes and achieves second scale
latencies
• Efficient and fault-tolerant stateful stream processing
• Integrates with Spark’s batch and interactive processing
• Provides a simple batch-like API for implementing
complex algorithms.
7. Discretized Stream Processing
• Run a streaming computation as a series of very small,
deterministic batch jobs.
Chop up the live stream into batches of X
seconds
Spark treats each batch of data as RDDs
and processes them using RDD operations
Finally, the processed results of the RDD
operations are returned in batches
Batch sizes as low as ½ second, latency ~
1 second
Potential for combining batch processing
and streaming processing in the same
system
Spark
Streaming
Spark
processed
results
Live data stream
batches of X
seconds
8. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
Twitter Streaming API
batch @ t+1batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed dataset)
9. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
val hashTags = tweets.flatMap (status => getTags(status))
new DStream
transformation: modify data in one DStream to create
another DStream
batch @ t+1batch @ t batch @ t+2
flatMap flatMap flatMap
hashTags Dstream
[#cat, #dog, … ]
new RDDs created
for every batch
tweets Dstream
10. val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
Example 2 – Count the hashtags over last 1 min
sliding window
operation
window length sliding interval
DStream of data
window length
sliding interval
11. • val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
Example 2 – Count the hashtags over last 1 min
tagCounts
hashTags
t-1 t t+1 t+2 t+3
sliding window
countByValue
count over all
the data in the
window
12. Fault-tolerance
• RDDs remember the operations that
created them
• Batches of input data are replicated
in memory for fault-tolerance
• Data lost due to worker failure, can
be recomputed from replicated input
data
• Therefore, all transformed data is
fault-tolerant
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
13. Fault-tolerance
• Spark Streaming program
t = ssc.twitterStream(“…”)
.map(…)
t.foreach(…)
t1 = ssc.twitterStream(“…”)
t2 = ssc.twitterStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
DStream Graph
T
M
F
E
Twitter Input DStream
Mapped DStream
Foreach DStreamDummy DStream signifying
an output operation
T
U
M
T
M FF
E
F
E
F
E
14. Dstream Graph -> RDD Graphs -> Spark Jobs
• Every interval, RDD graph is computed from DStream graph
• For each output operation, a Spark action is created
• For each action, a Spark job is created to compute it
T
U
M
T
M FF
E
F
E
F
E
DStream Graph RDD GraphBlock RDDs with
data received in
last batch interval B
U
M
B
M FA
A A
3 Spark Jobs
15. Execution Model – Job Scheduling
• Spark Streaming +Spark Driver
Network
Input
Tracker
Job
Scheduler Spark’s
Schedulers
DStream
Graph
Job
Manager
JobQueue
Jobs
Block IDsRDDs
Spark Workers
Jobs executed on
worker nodes
Block
Manager
Data
Received
Block
Manager
16. RDD Checkpointing
Saving RDD to HDFS to prevent RDD graph from growing too large
• Done internally in Spark transparent to the user program
• Done lazily, saved to HDFS the first time it is computed
red_rdd.checkpoint() HDFS file
Contents of red_rdd saved
to a HDFS file
transparent to
all child RDDs
17. RDD Checkpointing
Stateful DStream operators can have infinite lineages
data
t-1 t t+1 t+2 t+3
states
Large lineages lead to …
• Large closure of the RDD object large task sizes high task launch
times
• High recovery times under failure
• Periodic RDD Check-pointing solves this
• Useful for iterative Spark programs as well.
HDF
S
HDF
S
18. Performance Tuning
• Increase Read parallelism
• Increase downstream processing parallelism
• Achieve stable configuration that can sustain the
streaming workload
• Optimize for low-latency
• Memory settings and explore GC options.
• Achieve Fault-tolerance
• Serializing the objects.
19. Analytics transforms the business
Institutionalization
Real time
Data
Sophistication
Sharpen the saw
Support strategic decisions
Achieve breakthrough innovation
Observe everything
Fuse external data
Leverage unstructured data
Incorporate a “feedback” loop
Explore AI
Leverage unsupervised methods
Build a data driven culture
Do systematic experimentation
Forge a multidisciplinary team
Operationalize decisions
Reduce decision latency
Increase contextual relevance
Disruption
21. Machine Learning
• It is derived from the concept that it deals with “construction
and study of systems that can learn from data”
• It is seen as building blocks to make computers learn to
behave more intelligently.
• Two phases in learning process – training & testing
• Two kinds of learning
• Unsupervised
• no labels in the training data
• Algorithms detects the patterns in the data and groups the
observations of similar characteristics together
• Supervised
• We have training data with correct labels
• Use training data to prepare the algorithm
• Then apply it to data without a correct label
22. Some types of algorithms
• Prediction
• predicting variable from data
• Classification
• Assigning observations to pre-defined classes
• Clustering
• Splitting observations into groups based on similarity
• Recommendation
• Predicts what people might like & uncovers relationship between
items.
23. Steps in Analytics Workload
• Data Collection
• Pre-processing the data (cleaning & data munging)
• Retrieve sample data from the actual population
• Descriptive statistics on the sample data
• Exploratory Data Analysis with Spark
• Uni-variate analysis
• Bivariate analysis
• Missing Value treatment
• Outlier- detection
• Feature Engineering
• Apply machine learning models
• Optimize and fine-tune the model parameters
24. Sample Data
Types of labels:
• Denial of service (DoS – attack type)
• Normal
• Probe (attack)
• R2L (attack)
• U2R(attack)
25. Unsupervised Learning - Clustering
• Clustering is the assignment of a set of observations into subsets
(called clusters) so that observations in the same cluster are similar in
some sense.
• Find areas dense with data (also area without
• data)
• Anomaly – far from any cluster
• Supervise with labels to improve,
interpret
26. Streaming K-means (K-means ++)
• Assign points to nearest center, update centers, iterate
• Goal: points close to nearest cluster center
• Must choose k = number of clusters
• ++ means smarter starting point
27. Clustering – choosing parameters
• Initial plotted tsne plot which gives the
distribution of data in 2 dimensions. It helps
to identify if the data can be clustered.
• Normalize the data before applying k-means
i.e. standardize the scores as
• Choose k value using elbow method or
using PCA analysis
• Convert categorical variables to numeric
using one-hot encoding
tsne plot
elbow plot
28. Streaming k-means
Approach:
Start with k cluster centres initially
For every incoming batch of data, centroids keep updating.
The clusters drift over time, and after certain stage, they stabilize.
Continuously learns new data patterns.
Outliers are detected as anomalies.
Pros:
More useful when the data points don't have labels associated with them.
Simple to implement.
Cons:
Doesn't fit for high dimension data.
Kafka
Streaming
K-means
Network
Data
29. “Offline” vs “Online” algorithms
Build models on static data
Train algorithms on “batches” of data
Use the model to make predictions on
incoming data stream
• Pros:
Easy to analyze
High accuracy
Batch algorithms are quite accessible.
• Cons:
Unable to identify dynamic patterns
Build model on live stream of data
Training happens continuously on live
data
Use the model for both predict and learn
on streaming data.
• Pros:
Model evolves continuously.
Identifies rapidly changing patterns in
the data.
• Cons:
Streaming algorithms are not widely
available.
Active area of research
Offline Learning Online Learning
30. Streaming SVM
In machine learning, support vector machines (SVM) are supervised learning
models with associated learning algorithms that analyze data used for classification
and regression analysis.
Capable of reflecting changes of dataset real time.
SVM is resistive to noise.
It uses high dimension to separate dataset.
Prediction rate can be increased by scaling the Spark cluster.