This document summarizes a lecture on distributed computing and Apache Spark. It discusses the challenges of processing large datasets, including machine failures and network communication costs. It introduces MapReduce as a programming model for distributed computing using examples like word counting. Apache Spark is presented as an improvement over Hadoop by keeping data in-memory between iterations for faster processing of iterative algorithms like machine learning. The three main Spark APIs of RDDs, DataFrames, and Datasets are also summarized.
This document discusses techniques for approximating and sampling big data to enable scalable machine learning. It describes how taking a sample of data or computing an approximate mean can provide a close enough answer while using fewer resources. It also discusses pruning techniques like discarding low similarity item pairs or limiting the number of preferences per user to reduce the size of similarity matrices when computing recommendations. The goal is to cheat by finding approximations that are not exactly right but close enough to enable processing very large datasets within resource constraints.
This document discusses why distributed computing is necessary for large datasets and complex queries. It argues that while a single computer may be able to handle datasets of billions of records for simple queries, more complex queries or larger datasets require distributed computing. Distributed systems can leverage multiple machines to parallelize work, handle hardware failures without losing entire datasets, and scale beyond the memory and processing limits of individual machines. The document uses examples of large census and web crawling datasets to illustrate how requirements quickly exceed what is possible on a single computer.
This document provides an introduction to computers and binary numbering. It explains that computers operate using binary, which has only two digits (0 and 1) compared to the human decimal system which uses base 10. Binary is used because computer circuits can only be in two states, on or off. The document gives examples of counting in binary and converting numbers between decimal and binary. It also discusses memes and provides chat acronyms and their meanings. Students are assigned a group project to present on a meme, explaining its four phases of spread.
This document provides an introduction to computers and binary numbering. It explains that computers operate using binary, which has only two digits (0 and 1) compared to the human decimal system which uses base 10. Binary is used because computer circuits can only be in two states, on or off. The document gives examples of counting in binary and converting numbers between decimal and binary. It also discusses memes and provides chat acronyms and their meanings. Students are assigned a group project to present on a meme, explaining its four phases of spread.
Building Stuff for Fun and Profit - confessions from a life in code and cablesHolly Cummins
I love making stuff. I'm so happy that my job allows me to make stuff, and when I'm not at work, I'm making stuff anyway. Some of the stuff I've made has solved real technical and business problems; some of it I've done just to see if I can. In this talk I'll describe some of the valuable things I've built for my employer, IBM, and our clients - I'll also describe some of the ridiculous things I've made for myself.
These are slides for a talk given at BuildStuff Odessa, 2016 (http://www.buildstuff.com.ua/odessa/)
This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.SANTIAGO PABLO ALBERTO
This document provides an introduction to the Raspberry Pi and discusses the motivation behind its creation. It notes that fewer people today have an in-depth understanding of computer hardware and how software interacts with it. The Raspberry Pi aims to address this by being an accessible yet fully functional computer that gives users insight into its inner workings.
This document discusses techniques for approximating and sampling big data to enable scalable machine learning. It describes how taking a sample of data or computing an approximate mean can provide a close enough answer while using fewer resources. It also discusses pruning techniques like discarding low similarity item pairs or limiting the number of preferences per user to reduce the size of similarity matrices when computing recommendations. The goal is to cheat by finding approximations that are not exactly right but close enough to enable processing very large datasets within resource constraints.
This document discusses why distributed computing is necessary for large datasets and complex queries. It argues that while a single computer may be able to handle datasets of billions of records for simple queries, more complex queries or larger datasets require distributed computing. Distributed systems can leverage multiple machines to parallelize work, handle hardware failures without losing entire datasets, and scale beyond the memory and processing limits of individual machines. The document uses examples of large census and web crawling datasets to illustrate how requirements quickly exceed what is possible on a single computer.
This document provides an introduction to computers and binary numbering. It explains that computers operate using binary, which has only two digits (0 and 1) compared to the human decimal system which uses base 10. Binary is used because computer circuits can only be in two states, on or off. The document gives examples of counting in binary and converting numbers between decimal and binary. It also discusses memes and provides chat acronyms and their meanings. Students are assigned a group project to present on a meme, explaining its four phases of spread.
This document provides an introduction to computers and binary numbering. It explains that computers operate using binary, which has only two digits (0 and 1) compared to the human decimal system which uses base 10. Binary is used because computer circuits can only be in two states, on or off. The document gives examples of counting in binary and converting numbers between decimal and binary. It also discusses memes and provides chat acronyms and their meanings. Students are assigned a group project to present on a meme, explaining its four phases of spread.
Building Stuff for Fun and Profit - confessions from a life in code and cablesHolly Cummins
I love making stuff. I'm so happy that my job allows me to make stuff, and when I'm not at work, I'm making stuff anyway. Some of the stuff I've made has solved real technical and business problems; some of it I've done just to see if I can. In this talk I'll describe some of the valuable things I've built for my employer, IBM, and our clients - I'll also describe some of the ridiculous things I've made for myself.
These are slides for a talk given at BuildStuff Odessa, 2016 (http://www.buildstuff.com.ua/odessa/)
This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.SANTIAGO PABLO ALBERTO
This document provides an introduction to the Raspberry Pi and discusses the motivation behind its creation. It notes that fewer people today have an in-depth understanding of computer hardware and how software interacts with it. The Raspberry Pi aims to address this by being an accessible yet fully functional computer that gives users insight into its inner workings.
Here are the key points about memory management challenges and algorithms for traditional memory mapping:
- Fragmentation is a major challenge - both internal and external fragmentation can occur when memory is allocated and freed over time in a non-contiguous manner. This wastes memory.
- First-fit algorithm - allocates memory in the first partition that fits the request. Fast but can lead to internal fragmentation.
- Best-fit algorithm - allocates memory in the smallest partition that fits the request. Less internal fragmentation but slower than first-fit.
- Worst-fit algorithm - allocates memory in the largest partition that fits the request. Least internal fragmentation but slowest.
- Compaction techniques can be used to
Outrageous ideas for Graph Databases
Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.
This document summarizes an introductory session on programming in the digital humanities. It discusses how programming involves complex work in figuring out what to do and which languages to use. Examples are provided of tasks a programming language could perform on text data, like finding quotes from a novel or allowing a user to search a text file. The document emphasizes that critical thinking is important to programming in the humanities. It also discusses different ways of structuring data, such as with markup languages like HTML and TEI, or in a structured format like a database. The goal is to make data understandable to computers while retaining its usefulness. Collaboration is important when creating structured data.
This document provides an overview of a masterclass on big data presented by Prof.dr.ir. Arjen P. de Vries. It discusses defining properties of big data, challenges in big data analytics including capturing, aligning, transforming, modeling and understanding large datasets. It also briefly introduces map-reduce and streaming data analysis. Examples of large datasets that could be analyzed are provided, such as the sizes of datasets from Facebook, Google and other organizations.
MBA632 Lecture, Morehead State UniversityBrad Ward
Slide deck from a guest lecture I did for MBA632 at Morehead State University. Use http://linkbun.ch links to see the specific pages I went over for the screenshare.
The speaker discusses how they used Terraform to improve their workflow for data science projects. As a data scientist, they spent most of their time dealing with infrastructure issues rather than the data science work. Terraform's "infrastructure as code" approach allowed them to define and provision resources like servers and databases in a declarative way. This improved reproducibility and made it easier to setup and destroy resources for experiments. Modules also helped abstract complexity and allowed resources to be composed together. The speaker argues this approach can benefit both data scientists and devops teams by making infrastructure part of the reproducible workflow.
1. The document discusses the history of computers through their generations of development.
2. It describes the five generations from the first generation using vacuum tubes to the current fourth generation using microprocessors.
3. The key developments included the transition from vacuum tubes to transistors to integrated circuits and finally microprocessors, which made computers smaller, faster, cheaper and more powerful with each generation.
Supercomputers are the most powerful computers in the world capable of performing trillions of calculations per second. They use parallel processing, which splits problems into pieces to be solved simultaneously by thousands of processors. While early supercomputers were single, massive machines, modern supercomputers often consist of clusters of networked computers working together. Supercomputers are used for advanced scientific and engineering applications like weather modeling, nuclear research, and animations that regular computers cannot handle due to their immense processing power needs.
The document discusses various computer terms and technologies. It provides definitions for common computer hardware and software components like the personal computer, laptop, hardware, software, and hard disk. It also defines Internet-related terms such as the World Wide Web, website, newsgroup, chatroom, and FAQ. The document contains a glossary of computer and Internet terms.
Many companies are looking for "DevOps'' in many forms, but what kind of skills or experiences are actually needed? I’ll debunk some of the myths surrounding what recruiters or internet lurkers might tell you and find out if you might actually have an aptitude for Site Reliability or Infrastructure Engineering. If so, what might be good knowledge areas to get started with? And if learning leads to an interview, what might that look like?
Have you seen the gear people strap on every day to / from work? Look around the next time you are in the office or on the train. Do we really need these big bags full of I-hope-I-get-to-this stuff?
The more we tout lighter / smaller laptops and tablets, and more powerful phones, the more stuff we seem to be carrying, and the less mobile we are becoming.
Technology is here to help us. Let's start using it a little more wisely.
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
This document summarizes work on digital normalization, a technique for reducing sequencing data size prior to assembly. Digital normalization works by discarding reads whose k-mer counts are below a cutoff, based on analysis of k-mer abundances across the dataset. It can remove over 95% of data in a single pass with fixed memory. This makes genome and metagenome assembly scalable to larger datasets using cloud computing resources. The work is done in an open science manner, with all code, data, and manuscripts openly accessible online.
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
This document summarizes work on digital normalization, a technique for reducing sequencing data size prior to assembly. Digital normalization works by discarding reads whose k-mer counts are below a cutoff, based on analysis of k-mer frequencies in the de Bruijn graph. It can remove over 95% of data in a single pass with fixed memory. Digital normalization enables assembly of large datasets in the cloud by reducing data size and memory requirements. The document acknowledges collaborators and funding sources and provides links for code, blogs, papers, and future events.
The document provides tips on common scalability mistakes made when designing web applications and strategies to avoid them. It discusses the importance of considering scalability from the beginning, avoiding blocking calls, caching frequently accessed data, optimizing database and file system usage, and using tools like profilers to identify bottlenecks. The key is designing applications that can scale both up and down based on current needs through a proactive, process-oriented approach.
The Naturalization Civics Test helps immigrants become United States citizens by assessing their knowledge of American history and government. The test consists of 10 questions, and applicants need to correctly answer only 6 questions to pass. While the test was originally intended to be difficult, it has become relatively easy to pass if applicants study American history. The test plays an important role in the naturalization process by providing a standardized way to assess applicants' understanding of the U.S., while allowing them to become citizens even if they do not answer all questions correctly.
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
This talk will address one apparently simple question: which open source machine learning tools one can use out-of-the-box to do binary classification (probably the most common machine learning problem) on largish datasets. We will discuss therefore the speed, scalability and accuracy of the top (most accurate) learning algorithms in R packages, Python’s scikit-learn, H2O, xgboost and Spark’s MLlib.
The document summarizes the Perl Conference 2019 that was hosted in Pittsburgh from May 13-17. It provides highlights from some of the sessions including talks on Perl 5 and its future, using tmux more efficiently, regular expressions, Rust, git, accessibility in programming, and more. Lightning talks covered topics such as Perl, pens, and even renaming the Perl language. The summary encourages attendees to watch videos from sessions they were unable to attend.
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance SpecialistMartin Packer
This document provides tips for performance specialists to improve their skills and become more valuable to their organizations. It recommends continuously learning new skills, gaining experience across different systems, engaging with industry communities, experimenting with new data visualization techniques, and keeping an innovative mindset. The goal is to add more value through creative problem solving while maintaining good relationships.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Here are the key points about memory management challenges and algorithms for traditional memory mapping:
- Fragmentation is a major challenge - both internal and external fragmentation can occur when memory is allocated and freed over time in a non-contiguous manner. This wastes memory.
- First-fit algorithm - allocates memory in the first partition that fits the request. Fast but can lead to internal fragmentation.
- Best-fit algorithm - allocates memory in the smallest partition that fits the request. Less internal fragmentation but slower than first-fit.
- Worst-fit algorithm - allocates memory in the largest partition that fits the request. Least internal fragmentation but slowest.
- Compaction techniques can be used to
Outrageous ideas for Graph Databases
Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.
This document summarizes an introductory session on programming in the digital humanities. It discusses how programming involves complex work in figuring out what to do and which languages to use. Examples are provided of tasks a programming language could perform on text data, like finding quotes from a novel or allowing a user to search a text file. The document emphasizes that critical thinking is important to programming in the humanities. It also discusses different ways of structuring data, such as with markup languages like HTML and TEI, or in a structured format like a database. The goal is to make data understandable to computers while retaining its usefulness. Collaboration is important when creating structured data.
This document provides an overview of a masterclass on big data presented by Prof.dr.ir. Arjen P. de Vries. It discusses defining properties of big data, challenges in big data analytics including capturing, aligning, transforming, modeling and understanding large datasets. It also briefly introduces map-reduce and streaming data analysis. Examples of large datasets that could be analyzed are provided, such as the sizes of datasets from Facebook, Google and other organizations.
MBA632 Lecture, Morehead State UniversityBrad Ward
Slide deck from a guest lecture I did for MBA632 at Morehead State University. Use http://linkbun.ch links to see the specific pages I went over for the screenshare.
The speaker discusses how they used Terraform to improve their workflow for data science projects. As a data scientist, they spent most of their time dealing with infrastructure issues rather than the data science work. Terraform's "infrastructure as code" approach allowed them to define and provision resources like servers and databases in a declarative way. This improved reproducibility and made it easier to setup and destroy resources for experiments. Modules also helped abstract complexity and allowed resources to be composed together. The speaker argues this approach can benefit both data scientists and devops teams by making infrastructure part of the reproducible workflow.
1. The document discusses the history of computers through their generations of development.
2. It describes the five generations from the first generation using vacuum tubes to the current fourth generation using microprocessors.
3. The key developments included the transition from vacuum tubes to transistors to integrated circuits and finally microprocessors, which made computers smaller, faster, cheaper and more powerful with each generation.
Supercomputers are the most powerful computers in the world capable of performing trillions of calculations per second. They use parallel processing, which splits problems into pieces to be solved simultaneously by thousands of processors. While early supercomputers were single, massive machines, modern supercomputers often consist of clusters of networked computers working together. Supercomputers are used for advanced scientific and engineering applications like weather modeling, nuclear research, and animations that regular computers cannot handle due to their immense processing power needs.
The document discusses various computer terms and technologies. It provides definitions for common computer hardware and software components like the personal computer, laptop, hardware, software, and hard disk. It also defines Internet-related terms such as the World Wide Web, website, newsgroup, chatroom, and FAQ. The document contains a glossary of computer and Internet terms.
Many companies are looking for "DevOps'' in many forms, but what kind of skills or experiences are actually needed? I’ll debunk some of the myths surrounding what recruiters or internet lurkers might tell you and find out if you might actually have an aptitude for Site Reliability or Infrastructure Engineering. If so, what might be good knowledge areas to get started with? And if learning leads to an interview, what might that look like?
Have you seen the gear people strap on every day to / from work? Look around the next time you are in the office or on the train. Do we really need these big bags full of I-hope-I-get-to-this stuff?
The more we tout lighter / smaller laptops and tablets, and more powerful phones, the more stuff we seem to be carrying, and the less mobile we are becoming.
Technology is here to help us. Let's start using it a little more wisely.
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
This document summarizes work on digital normalization, a technique for reducing sequencing data size prior to assembly. Digital normalization works by discarding reads whose k-mer counts are below a cutoff, based on analysis of k-mer abundances across the dataset. It can remove over 95% of data in a single pass with fixed memory. This makes genome and metagenome assembly scalable to larger datasets using cloud computing resources. The work is done in an open science manner, with all code, data, and manuscripts openly accessible online.
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
This document summarizes work on digital normalization, a technique for reducing sequencing data size prior to assembly. Digital normalization works by discarding reads whose k-mer counts are below a cutoff, based on analysis of k-mer frequencies in the de Bruijn graph. It can remove over 95% of data in a single pass with fixed memory. Digital normalization enables assembly of large datasets in the cloud by reducing data size and memory requirements. The document acknowledges collaborators and funding sources and provides links for code, blogs, papers, and future events.
The document provides tips on common scalability mistakes made when designing web applications and strategies to avoid them. It discusses the importance of considering scalability from the beginning, avoiding blocking calls, caching frequently accessed data, optimizing database and file system usage, and using tools like profilers to identify bottlenecks. The key is designing applications that can scale both up and down based on current needs through a proactive, process-oriented approach.
The Naturalization Civics Test helps immigrants become United States citizens by assessing their knowledge of American history and government. The test consists of 10 questions, and applicants need to correctly answer only 6 questions to pass. While the test was originally intended to be difficult, it has become relatively easy to pass if applicants study American history. The test plays an important role in the naturalization process by providing a standardized way to assess applicants' understanding of the U.S., while allowing them to become citizens even if they do not answer all questions correctly.
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
This talk will address one apparently simple question: which open source machine learning tools one can use out-of-the-box to do binary classification (probably the most common machine learning problem) on largish datasets. We will discuss therefore the speed, scalability and accuracy of the top (most accurate) learning algorithms in R packages, Python’s scikit-learn, H2O, xgboost and Spark’s MLlib.
The document summarizes the Perl Conference 2019 that was hosted in Pittsburgh from May 13-17. It provides highlights from some of the sessions including talks on Perl 5 and its future, using tmux more efficiently, regular expressions, Rust, git, accessibility in programming, and more. Lightning talks covered topics such as Perl, pens, and even renaming the Perl language. The summary encourages attendees to watch videos from sessions they were unable to attend.
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance SpecialistMartin Packer
This document provides tips for performance specialists to improve their skills and become more valuable to their organizations. It recommends continuously learning new skills, gaining experience across different systems, engaging with industry communities, experimenting with new data visualization techniques, and keeping an innovative mindset. The goal is to add more value through creative problem solving while maintaining good relationships.
Similar to Lecture_2_-_Distributed_Computing.pdf (20)
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
1. 10-605/805 – ML for
Large Datasets
Lecture 2: Distributed
Computing
Henry Chai
9/1/22
2. Front Matter
– HW1 released 8/30, due 9/13 at 11:59 PM
– For HW1 only, the programming part is optional (but
strongly encouraged)
– The written part is nominally about PCA but can be
solved using pre-requisite knowledge (linear algebra)
– Recitations on Friday, 11:50 – 1:10 (different from lecture)
in GHC 4401 (same as lecture)
– Recitation 1 on 9/2: Introduction to PySpark/Databricks
Henry Chai - 9/1/22 2
3. Recall:
Machine
Learning
with Large
Datasets
– Premise:
– There exists some pattern/behavior of interest
– The pattern/behavior is difficult to describe
– There is data (sometimes a lot of it!)
– More data usually helps
– Use data efficiently/intelligently to “learn” the pattern
– Definition:
– A computer program learns if its performance, P, at
some task, T, improves with experience, E.
Henry Chai - 9/1/22 3
4. Okay, but how
much data are
we talking
about here?
– Premise:
– There exists some pattern/behavior of interest
– The pattern/behavior is difficult to describe
– There is data (sometimes a lot of it!)
– More data usually helps
– Use data efficiently/intelligently to “learn” the pattern
– Definition:
– A computer program learns if its performance, P, at
some task, T, improves with experience, E.
Henry Chai - 9/1/22 4
5. Units of Data
Unit Value Scale
Kilobyte (KB) 1000 bytes A paragraph of text
Megabyte (MB) 1000 KB A short novel
Gigabyte (GB) 1000 GB Beethoven’s 5th symphony
Terabyte (TB) 1000 TB
All the x-rays in a large
hospital
Petabyte (PB) 1000 PB
≈ ⁄
!
" of the content of all
US research libraries
Exabyte (EB) 1000 EB
≈ ⁄
!
# of the words humans
have ever spoken
Henry Chai - 9/1/22 5
6. Henry Chai - 9/1/22 6
Source: https://res.cloudinary.com/yumyoshojin/image/upload/v1/pdf/future-data-2019.pdf
7. The Big Data
Problem
– The sources and amount of data keep growing
– Data storage and processing can’t keep up
– A 1 TB hard disk costs ≈ $25 USD
– Reading 1 TB from disk ≈ 5 hours
– So it would cost $100,000 USD to store Facebook’s
4PB of data/day and take ≈ 2.5 years to read!
Henry Chai - 9/1/22 7
8. Recall:
Parallel
Computing
– Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Henry Chai - 9/1/22 8
9. – Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Recall:
Parallel
Computing
Henry Chai - 9/1/22 9
10. Cloud
Computing
– Enables distributed processing by democratizing access
to storage and computational resources
– Costs continue to decrease each year
Henry Chai - 9/1/22 10
12. How can we
program in this
setting?
Henry Chai - 9/1/22 12
Source: https://www.google.com/about/datacenters/gallery/
13. Challenges in
Cloud
Computing
1. How can we divide work across multiple machines?
– Key consideration: moving data across
machines/over a network is costly!
2. What can we do if a machine fails?
– If a node fails every 3 years …
– … then a center with 10,000 nodes will have 10
faults/day!
– Even worse are stragglers, or nodes that run slower
than others
Henry Chai - 9/1/22 13
14. Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 14
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
15. Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 15
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
16. Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 16
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
am 1
17. Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 17
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
am 1
Sam 1
18. Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 18
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
am 1
Sam 2
19. – Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Example:
Word Counting
Henry Chai - 9/1/22 19
Word Count
I 6
am 5
Sam 5
that 3
do 1
not 1
like 1
20. – Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
Example:
Word Counting
Henry Chai - 9/1/22 20
21. – Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
Example:
Word Counting
Henry Chai - 9/1/22 21
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
Word Count
I 8
am 6
Sam 6
that 3
⋮ ⋮
22. – Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
Example:
Word Counting
Henry Chai - 9/1/22 22
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 8
am 6
Sam 6
that 3
do 2
not ⋮
like
you 1
green ⋮
eggs
and 2
ham ⋮
them
23. – Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
MapReduce
(Apache
Hadoop)
Henry Chai - 9/1/22 23
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 8
am 6
Sam 6
that 3
do 2
not ⋮
like
you 1
green ⋮
eggs
and 2
ham ⋮
them
Map Reduce
24. Challenges in
Cloud
Computing
1. How can we divide work across multiple machines?
– Key consideration: moving data across
machines/over a network is costly!
2. What can we do if a machine fails?
– If a node fails every 3 years …
– … then a center with 10,000 nodes will have 10
faults/day!
– Even worse are stragglers, or nodes that run slower
than others
Henry Chai - 9/1/22 24
25. Machine
Failures
Henry Chai - 9/1/22 25
– Machines are plentiful, so just launch another job!
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 2
⋮ ⋮
26. Machine
Failures and
Stragglers
Henry Chai - 9/1/22 26
– Machines are plentiful, so just launch another job!
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 2
⋮ ⋮
27. Challenges in
Cloud
Computing
1. How can we divide work across multiple machines?
– Key consideration: moving data across
machines/over a network is costly!
2. What can we do if a machine fails?
– If a node fails every 3 years …
– … then a center with 10,000 nodes will have 10
faults/day!
– Even worse are stragglers, or nodes that run slower
than others
Henry Chai - 9/1/22 27
28. Communication
Hierarchy
Henry Chai - 9/1/22 28
main memory (RAM)
disk
≈ 1 GB/s
≈ 50 GB/s
≈ 1 GB/s
top-of-rack
switch
≈
1
GB/s
CPU
≈
0
.
5
G
B
/
s
in-rack nodes
other racks
29. What are the
implications for
MapReduce?
Henry Chai - 9/1/22 29
main memory (RAM)
disk
≈ 1 GB/s
≈ 50 GB/s
≈ 1 GB/s
top-of-rack
switch
≈
1
GB/s
CPU
≈
0
.
5
G
B
/
s
in-rack nodes
other racks
30. MapReduce
Henry Chai - 9/1/22 30
I 8
am 6
Sam 6
that 3
do 2
not ⋮
like
you 1
green ⋮
eggs
and 2
ham ⋮
them
Reduce
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
Map
32. – Iterative jobs involve lots of disk reading/writing per
iteration, which is slow!
– Issue: many machine learning/optimization algorithms are
inherently iterative!
MapReduce &
Iterative
Procedures
Machine
Learning
Henry Chai - 9/1/22 32
33. – Insight: the cost of RAM has been steadily decreasing
– Idea: shove more RAM into each machine and hold more
data in main memory → Apache Spark (circa 2010)
MapReduce &
Iterative
Procedures
Machine
Learning
Henry Chai - 9/1/22 33
RAM
Disk
Source: https://jcmit.net/mem2015.htm
1¢/MB
34. Apache Spark
vs. Hadoop
Henry Chai - 9/1/22 34
Figure courtesy of Virginia Smith
Iteration and Big Data Processing
Iteration in Hadoop:
Input
(e.g., from HDFS)
iteration 1 iteration 2 iteration 3 …
file system
read file system
write file system
read file system
write file system
read file system
write
Read/write
intermediate data
Read/write
intermediate data
Read/write
intermediate data
MapReduce program MapReduce program MapReduce program
Iteration in Spark:
Input
(e.g., from HDFS)
iteration 1 iteration 2 iteration 3 …
file system
read
In-memory computations, no need to read/write to disk.
35. Apache Spark
vs. Hadoop
Henry Chai - 9/1/22 35
Source: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
36. Apache Spark:
3 Main APIs
1. Resilient distributed datasets (RDDs):
– A collection of elements partitioned across machines in
a fault-tolerant way that can be operated on in parallel
2. Dataframes:
– An abstraction on top of the RDD API
– Similar to tables in a relational database with named
and typed columns that support SQL-like queries
3. Datasets
– A combination of RDDs and Dataframes
– Not available in PySpark
Henry Chai - 9/1/22 36
37. Apache Spark:
3 Main APIs
1. Resilient distributed datasets (RDDs):
– A collection of elements partitioned across machines in
a fault-tolerant way that can be operated on in parallel
2. Dataframes:
– An abstraction on top of the RDD API
– Similar to tables in a relational database with named
and typed columns that support SQL-like queries
3. Datasets
– A combination of RDDs and Dataframes
– Not available in PySpark
Henry Chai - 9/1/22 37
38. RDDs
Element 1
Element 2
Element 3
Element 4
Element 5
Element 6
Element 7
Element 8
Element 9
Element 10
Henry Chai - 9/1/22 38
Machine 2
Machine 1
Machine 3
An RDD split into 5 partitions
Rule of thumb: 2-4 partitions per CPU
39. Operations on
RDDs
– Transformations: return a new RDD
– Lazy – the new RDD is not
computed immediately
– Actions: compute a result from an RDD
and return/write that result to disk
– Eager – the result is computed
immediately
– Lazy vs. eager operations allows Spark to
efficiently manage data transfer
Henry Chai - 9/1/22 39
semantic
difference
functional
difference
40. Transformations
Henry Chai - 9/1/22 40
– Transformations: return a new RDD
– Lazy – the new RDD is not
computed immediately
– Sequence of transformations defines a
“recipe” for computing some result
– Allows Spark to efficiently recover
from failures/stragglers by recalling
the steps required to bring a new
machine up to speed
41. Transformations:
Examples
Henry Chai - 9/1/22 41
Transformation Description
map(func) returns a new RDD by applying the function func
to each element of the source RDD
flatMap(func) similar to map except each element of the source
RDD can be mapped to 0 or more outputs
filter(func) returns a new RDD consisting of the elements
where func evaluates to TRUE
distinct() returns a new RDD consisting of the unique
elements of the source RDD
union(otherRDD) returns a new RDD consisting of the union of
elements in the source RDD and otherRDD
intersection
(otherRDD)
returns a new RDD consisting of the intersection
of elements in the source RDD and otherRDD
42. Actions
Henry Chai - 9/1/22 42
– Transformations: return a new RDD
– Lazy – the new RDD is not
computed immediately
– Actions: compute a result from an RDD
and return/write that result to disk
– Eager – the result is computed
immediately
– Required to get Spark to “do” anything,
i.e., generate the output we want
43. Actions:
Examples
Henry Chai - 9/1/22 43
Action Description
reduce(func) aggregates the elements of the source RDD using
the function func, a commutative and associative
function that takes two arguments and returns
one (allowing for parallelization)
collect() returns all elements of the source RDD as an array
to the driver program
take(n) returns the first n elements of the source RDD as
an array to the driver program
first() returns the first element of the source RDD as an
array to the driver program
count() returns the number of elements in the source RDD
44. For A Detailed
Walkthrough…
– Recitation 1 and HW1 will walk you step-by-step through…
– The architecture of a typical Spark job
– Lambda functions
– RDDs and caching
– Dataframes and schema
– Getting setup in Databricks
Henry Chai - 9/1/22 44
45. Key Takeaways
– MapReduce as a framework for distributed processing
– Heavy disk I/O → poorly suited for iterative procedures
– Apache Spark
– Big idea: keep data in memory as much as possible
– RDDs are the foundational API
– Transformations (lazy) vs actions (eager)
– Laziness/eagerness allows Spark to optimize the
execution of operations
Henry Chai - 9/1/22 45
46. Homework 1
Programming
Preview
Henry Chai - 9/1/22 46
– Part 1: Entity Resolution with RDDs
– Identifying and linking different occurrences of the
same object across multiple data sources
– Different titles/names for the same person
– Different pictures of the same physical object
– Motivating example: product listings on different
e-commerce websites
– Google shopping: clickart 950000 - premier
image pack (dvd-rom) massive collection of
images & fonts for all your design needs …
– Amazon: clickart 950 000 - premier image pack
(dvd-rom)
47. Homework 1
Programming
Preview
– Part 1: Entity Resolution with RDDs
– Determine if two product listings correspond to the
same product using text similarity
– Distance between texts defined as the cosine
similarity of their TF-IDF representations:
– Similar to bag-of-words model except instead of
occurrences, each word is defined by its term
frequency times its inverse document frequency
– TF(word, doc) =
# &' ()*+, -./0 122+13, )4 0.5
(&(16 # &' 7&38, )4 0.5
– IDF(word) =
(&(16 # &' 8&9:*+4(,
# &' 8&9:*+4(, (;1( 9&4(1)4 -./0
– Upweights words that occur a lot in a few specific
documents and rarely in other documents
Henry Chai - 9/1/22 47
48. Homework 1
Programming
Preview
– Part 2: Machine Learning Pipeline with DataFrames
– Task: predicting power generation of power plants
under different environmental conditions
1. Data ETL (extract-transform-load): converting raw
data into a workable form
2. Exploration and visualization
3. Modelling
4. Hyperparameter optimization and model selection
Henry Chai - 9/1/22 48