MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATIONIJNSA Journal
Due to the rapid growth of the internet, malicious websites [1] have become the cornerstone for internet crime activities. There are lots of existing approaches to detect benign and malicious websites — some of them giving near 99% accuracy. However, effective and efficient detection of malicious websites has now
seemed reasonable enough in terms of accuracy, but in terms of processing speed, it is still considered an enormous and costly task because of their qualities and complexities. In this project, We wanted to implement a classifier that would detect benign and malicious websites using network and application features that are available in a data-set from Kaggle, and we will do that using Map Reduce to make the classification speeds faster than the traditional approaches.[2].
Paradigm4 Research Report: Leaving Data on the tableParadigm4
While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table.
You've heard the news, Data Science is the cool new career opportunity sweeping the world. Come learn from Thinkful Mentors all about this new and exciting industry.
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATIONIJNSA Journal
Due to the rapid growth of the internet, malicious websites [1] have become the cornerstone for internet crime activities. There are lots of existing approaches to detect benign and malicious websites — some of them giving near 99% accuracy. However, effective and efficient detection of malicious websites has now
seemed reasonable enough in terms of accuracy, but in terms of processing speed, it is still considered an enormous and costly task because of their qualities and complexities. In this project, We wanted to implement a classifier that would detect benign and malicious websites using network and application features that are available in a data-set from Kaggle, and we will do that using Map Reduce to make the classification speeds faster than the traditional approaches.[2].
Paradigm4 Research Report: Leaving Data on the tableParadigm4
While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table.
You've heard the news, Data Science is the cool new career opportunity sweeping the world. Come learn from Thinkful Mentors all about this new and exciting industry.
Join our #DataTalk on Thursdays at 5 p.m. ET. This week, we tweeted with Dr. Michael Wu, the Chief Scientist at Lithium, where he applies data-driven methodologies to investigate the complex dynamics of the social web.
Michael works with big data and has developed many predictive and prescriptive social analytics with actionable insights. His R&D won him the recognition as a 2010 Influential Leader by CRM Magazine.
You can see all tweets and resources here:
http://www.experian.com/blogs/news/about/data-scientists/
BIG Data & Hadoop Applications in Social MediaSkillspeed
Explore the applications of BIG Data & Hadoop in Social Media via Skillspeed.
BIG Data & Hadoop in Social Media is a key differentiator, especially in terms of generating memorable customer experiences.
Herein, we discuss how leading social networks such as Facebook, Twitter, Pinterest, LinkedIN, Instagram & Stumble Upon utilize Hadoop.
To get more details regarding BIG Data & Hadoop, please visit - www.SkillSpeed.com
Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565
Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use.
My research addresses these challenges by
(1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users.
(2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.
We present solutions on how to make the cyberspace secure through feature-rich, robust, yet lean machine learning-based algorithms that help organizations identify malicious actors, intruders and illegal system access by studying features that arise purely from system login behavior.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DigitYser
Dr. Kirk Borne is a Principal Data Scientist at Booz Allen Hamilton. With a rich background in Astrophysics and Computational Science, he was a precursor on implementing courses of big data in academia. He is one of the most important promotors of data literacy in the world.
About Kirk and his view on data literacy and evolution
On his first visit to Brussels, Kirk first activity was sharing his best practices to promote data literacy. While enjoying a magnificent view of Brussels from the ING headquarter building, Kirk playfully (with a pair of socks!) explained how subjectivity plays a major role in the way that data is understood, derived by the wide variety of involved. This keynote was delivered at the speakers reception, which took place the day before the DI Summit.
The following day, Kirk wrapped up the DI summit with his closing keynote on how data has shifted into something that is sense-making, following the evolution from “data” to “big data” into “smart data” composed by both enriched and semantic data and essential for IoT. He also discussed the levels of maturity in a self-driving enterprise, wrapping up his participation sharing this equation:
Big Data + IoT + Citizen Data Scientists = Partners in Sustainability
Kirk’s impression on the DI Summit was that it was a fun and informative event to join. His favorite format were the 5” pitches, as they were properly structured, providing the most critical information to the attendees. He also think that the networking dynamic ensured that all attendees met interesting people.
A takeaway from Kirk’s presentation
“Big data is not about how big it is, but the value you extract from it”
We look forward to have Kirk sometime soon back in Brussels!
Kirk’s interview:
Kirk’s presentation recording:
Kirk’s decks:
Kirk’s presentation drawing:
2) Here are some video interviews that I have done:
https://www.youtube.com/watch?v=ku2na1mLZZ8
https://www.youtube.com/watch?v=iXjvht91nFk
Here is my TedX talk: https://www.youtube.com/watch?v=Zr02fMBfuRA
Ofer Ron, senior data scientist at LivePerson.
Recently, I've had the pleasure of presenting an introduction to Data Science and data driven products at DevconTLV
I focused this talk around the basic ideas of data science, not the technology used, since I thought that far too many times companies and developers rush to play around with "big data" related technologies, instead of figuring out what questions they want to answer, and whether these answers form a successful product.
CGIAR Collaborative Platform for Gender Research - Gender meets big dataCGIAR
This presentation was given by Marcelo Tyszler (CGIAR Collaborative Platform for Gender Research / KIT Royal Tropical Institute), as part of the Annual Scientific Conference hosted by the University of Canberra and co-sponsored by the University of Canberra, the Australian Centre for International Agricultural Research (ACIAR) and CGIAR Collaborative Platform for Gender Research. The event took place on April 2-4, 2019 in Canberra, Australia.
Read more: https://www.canberra.edu.au/research/faculty-research-centres/aisc/seeds-of-change and https://gender.cgiar.org/annual-conference-2019/
The presentation is about the career path in the field of Data Science. Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
In this presentation, Jon Loyens will share:
-Best practices for sharing context and knowledge about your data projects
-How linked data can augment your existing data science workflow and toolchain to accelerate your work
-How a social network can unlock power of Linked Data and data collaboration
-How Linked Data can help you easily combine private and Open Data for fun and profit
Fundamentals of Big Data in 2 minutes!!Simplify360
In today’s world where information is increasing every second, BIG DATA takes up a major role in transforming any business.
Learn the fundamentals of big data in just 2 minutes!
Personalized Search at Sandia National LabsLucidworks
Clay Pryor, R&D S&E, Computer Science & Ryan Cooper, Sandia National Labs. Presentation from ACTIVATE 2019, the Search and AI Conference hosted by Lucidworks. http://www.activate-conf.com
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.
Why Data Mining?
What Is Data Mining?
Data Mining: On What Kind of Data?
Data Classification
What is Sentiment Classification?
Importance of Sentiment classification
Twitter for Sentiment Classification
Problem Statement
Goal of this Classifications
Method to be used
Conclusion
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Join our #DataTalk on Thursdays at 5 p.m. ET. This week, we tweeted with Dr. Michael Wu, the Chief Scientist at Lithium, where he applies data-driven methodologies to investigate the complex dynamics of the social web.
Michael works with big data and has developed many predictive and prescriptive social analytics with actionable insights. His R&D won him the recognition as a 2010 Influential Leader by CRM Magazine.
You can see all tweets and resources here:
http://www.experian.com/blogs/news/about/data-scientists/
BIG Data & Hadoop Applications in Social MediaSkillspeed
Explore the applications of BIG Data & Hadoop in Social Media via Skillspeed.
BIG Data & Hadoop in Social Media is a key differentiator, especially in terms of generating memorable customer experiences.
Herein, we discuss how leading social networks such as Facebook, Twitter, Pinterest, LinkedIN, Instagram & Stumble Upon utilize Hadoop.
To get more details regarding BIG Data & Hadoop, please visit - www.SkillSpeed.com
Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565
Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use.
My research addresses these challenges by
(1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users.
(2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.
We present solutions on how to make the cyberspace secure through feature-rich, robust, yet lean machine learning-based algorithms that help organizations identify malicious actors, intruders and illegal system access by studying features that arise purely from system login behavior.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DigitYser
Dr. Kirk Borne is a Principal Data Scientist at Booz Allen Hamilton. With a rich background in Astrophysics and Computational Science, he was a precursor on implementing courses of big data in academia. He is one of the most important promotors of data literacy in the world.
About Kirk and his view on data literacy and evolution
On his first visit to Brussels, Kirk first activity was sharing his best practices to promote data literacy. While enjoying a magnificent view of Brussels from the ING headquarter building, Kirk playfully (with a pair of socks!) explained how subjectivity plays a major role in the way that data is understood, derived by the wide variety of involved. This keynote was delivered at the speakers reception, which took place the day before the DI Summit.
The following day, Kirk wrapped up the DI summit with his closing keynote on how data has shifted into something that is sense-making, following the evolution from “data” to “big data” into “smart data” composed by both enriched and semantic data and essential for IoT. He also discussed the levels of maturity in a self-driving enterprise, wrapping up his participation sharing this equation:
Big Data + IoT + Citizen Data Scientists = Partners in Sustainability
Kirk’s impression on the DI Summit was that it was a fun and informative event to join. His favorite format were the 5” pitches, as they were properly structured, providing the most critical information to the attendees. He also think that the networking dynamic ensured that all attendees met interesting people.
A takeaway from Kirk’s presentation
“Big data is not about how big it is, but the value you extract from it”
We look forward to have Kirk sometime soon back in Brussels!
Kirk’s interview:
Kirk’s presentation recording:
Kirk’s decks:
Kirk’s presentation drawing:
2) Here are some video interviews that I have done:
https://www.youtube.com/watch?v=ku2na1mLZZ8
https://www.youtube.com/watch?v=iXjvht91nFk
Here is my TedX talk: https://www.youtube.com/watch?v=Zr02fMBfuRA
Ofer Ron, senior data scientist at LivePerson.
Recently, I've had the pleasure of presenting an introduction to Data Science and data driven products at DevconTLV
I focused this talk around the basic ideas of data science, not the technology used, since I thought that far too many times companies and developers rush to play around with "big data" related technologies, instead of figuring out what questions they want to answer, and whether these answers form a successful product.
CGIAR Collaborative Platform for Gender Research - Gender meets big dataCGIAR
This presentation was given by Marcelo Tyszler (CGIAR Collaborative Platform for Gender Research / KIT Royal Tropical Institute), as part of the Annual Scientific Conference hosted by the University of Canberra and co-sponsored by the University of Canberra, the Australian Centre for International Agricultural Research (ACIAR) and CGIAR Collaborative Platform for Gender Research. The event took place on April 2-4, 2019 in Canberra, Australia.
Read more: https://www.canberra.edu.au/research/faculty-research-centres/aisc/seeds-of-change and https://gender.cgiar.org/annual-conference-2019/
The presentation is about the career path in the field of Data Science. Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
In this presentation, Jon Loyens will share:
-Best practices for sharing context and knowledge about your data projects
-How linked data can augment your existing data science workflow and toolchain to accelerate your work
-How a social network can unlock power of Linked Data and data collaboration
-How Linked Data can help you easily combine private and Open Data for fun and profit
Fundamentals of Big Data in 2 minutes!!Simplify360
In today’s world where information is increasing every second, BIG DATA takes up a major role in transforming any business.
Learn the fundamentals of big data in just 2 minutes!
Personalized Search at Sandia National LabsLucidworks
Clay Pryor, R&D S&E, Computer Science & Ryan Cooper, Sandia National Labs. Presentation from ACTIVATE 2019, the Search and AI Conference hosted by Lucidworks. http://www.activate-conf.com
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.
Why Data Mining?
What Is Data Mining?
Data Mining: On What Kind of Data?
Data Classification
What is Sentiment Classification?
Importance of Sentiment classification
Twitter for Sentiment Classification
Problem Statement
Goal of this Classifications
Method to be used
Conclusion
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
The software development process is complete for computer project analysis, and it is important to the evaluation of the random project. These practice guidelines are for those who manage big-data and big-data analytics projects or are responsible for the use of data analytics solutions. They are also intended for business leaders and program leaders that are responsible for developing agency capability in the area of big data and big data analytics .
For those agencies currently not using big data or big data analytics, this document may assist strategic planners, business teams and data analysts to consider the value of big data to the current and future programs.
This document is also of relevance to those in industry, research and academia who can work as partners with government on big data analytics projects.
Technical APS personnel who manage big data and/or do big data analytics are invited to join the Data Analytics Centre of Excellence Community of Practice to share information of technical aspects of big data and big data analytics, including achieving best practice with modeling and related requirements. To join the community, send an email to the Data Analytics Centre of Excellence
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
DevOps Support for an Ethical Software Development Life Cycle (SDLC)Mark Underwood
As part of the IEEE SA P7000 and P2675 working groups, it has been determined that DevOps engineering practices can support (or hinder) the environment for an ethical software development life cycle (SDLC). This deck scratches the surface.
Exploring the impact and evolution of Advanced Analytics Tools.pdfStats Statswork
The impact and evolution of advanced analytics tools have transformed how businesses operate, offering unprecedented insights and decision-making capabilities. Statstwork has been at the forefront of this evolution, providing cutting-edge solutions that leverage big data, machine learning, and AI. These tools enable companies to analyze vast amounts of data in real-time, identify trends, and predict future outcomes with high accuracy. As a result, businesses can optimize their operations, enhance customer experiences, and drive innovation. The continuous advancement of these tools promises even greater efficiencies and opportunities, making them indispensable in the modern data-driven landscape.
For more information contact:
https://www.statswork.com
& https://www.statswork.com/contact-us/
Contact our Experts:
Our Email id: info@statswork.com
Contact No: +91 8754467066
Exploring the impact and evolution of Advanced Analytics Tools.pdfStats Statswork
The impact and evolution of advanced analytics tools have transformed how businesses operate, offering unprecedented insights and decision-making capabilities. Statstwork has been at the forefront of this evolution, providing cutting-edge solutions that leverage big data, machine learning, and AI. These tools enable companies to analyze vast amounts of data in real-time, identify trends, and predict future outcomes with high accuracy. As a result, businesses can optimize their operations, enhance customer experiences, and drive innovation. The continuous advancement of these tools promises even greater efficiencies and opportunities, making them indispensable in the modern data-driven landscape.
For more information contact:
https://www.statswork.com
& https://www.statswork.com/contact-us/
Contact our Experts:
Our Email id: info@statswork.com
Contact No: +91 8754467066
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/09/responsible-ai-tools-and-frameworks-for-developing-ai-solutions-a-presentation-from-intel/
Mrinal Karvir, Senior Cloud Software Engineering Manager at Intel, presents the “Responsible AI: Tools and Frameworks for Developing AI Solutions” tutorial at the May 2023 Embedded Vision Summit.
Over 90% of businesses using AI say trustworthy and explainable AI is critical to business, according to Morning Consult’s IBM Global AI Adoption Index 2021. If not designed with responsible considerations of fairness, transparency, preserving privacy, safety and security, AI systems can cause significant harm to people and society and result in financial and reputational damage for companies.
How can we take a human-centric approach to design AI solutions? How can we identify different types of bias and what tools can we use to mitigate those? What are model cards, and how can we use them to improve transparency? What tools can we use to preserve privacy and improve security? In this talk, Karvir discusses practical approaches to adoption of responsible AI principles. She highlights relevant tools and frameworks and explores industry case studies. She also discusses building a well-defined response plan to help address an AI incident efficiently.
Practical Applications for Social Network Analysis in Public Sector Marketing...Mike Kujawski
Over the past decade there has been a growing public fascination with the complex connectedness of modern society. This has been driven in large part by the wide availability of public digital data produced through our daily interactions on the modern social web. This data can now easily be mined and analyzed to produce valuable and actionable business insights leading to better decision making in nearly every field of practice, especially marketing and communications. In this presentation, Joshua Gillmore and Mike Kujawski introduce the basics of social network analysis and some of the privacy related challenges that this rapidly growing space brings with it. Focus of this deck is on public sector organizations.
By: @mikekujawski and @joshuagillmore
PAARL's 1st Marina G. Dayrit Lecture Series held at UP's Melchor Hall, 5F, Proctor & Gamble Audiovisual Hall, College of Engineering, on 3 March 2017, with Albert Anthony D. Gavino of Smart Communications Inc. as resource speaker on the topic "Using Big Data to Enhance Library Services"
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
3. 20,000+
community members worldwide in 98 countries,
representing the largest global data science for social good
network
5 global chapters
250+ events around the world
Volunteer sign-ups from 174 countries
300+ projects completed, providing the most
comprehensive library of data science for social good
projects
150+ organizations helped
200,000+ hours donated
$35M+ pro bono services delivered
DataKind
by the Numbers
From evening or weekend events to
multi-month projects, our programs
are designed to provide social
organizations with the pro bono data
science innovation team they need to
tackle critical humanitarian issues.
5. 5
DataKind’s strategy is grounded in four key principles
Catalyze a thriving Data Science for Good ecosystem through partnerships: A healthy
ecosystem requires forming the right “we” to deliver on a range of data science needs – while we will
continue to deliver data science projects, we will also use partnerships wherever possible to create
an accessible ecosystem of data science resources.
Be the connective tissue between the social sector and private sector data science resources:
We will elevate attention to the Data Science for Good field, with the goal of building nonprofit
demand, data science talent / resources, and philanthropic investment for all.
Identify the brightest opportunities for data science: We are known for our ability to scope how to
apply data science to social sector organizations, ensuring that data science solutions are designed
thoughtfully, and implemented ethically, and used effectively.
Build data science projects that advance the field: DataKind will work directly with nonprofits in
targeted issue areas where there are unmet data science needs that stretch beyond the individual
organization and a coalition of interested and committed partners.
6. 6
How data science might help nonprofits?
Expand impact by anticipating future needs
Scale services by providing personalized support
Save staff time by automating processes
Better understand the communities served
Better target efforts and find those in need of services
Use open/external data sources to inform decision making
7. DataCorps Process
The team wrangles
the data and
identifies external
data sources to
leverage.
We explore what’s
possible, then staff an
expert volunteer
team.
2. Data
Discovery
1. Problem
Exploration
The team co-creates
solutions with the
partner, while
DataKind oversee
their work.
Based on feedback,
the team makes
adjustments to meet
the partner’s needs.
The team delivers the
final version and
documentation so the
partner can increase
its impact.
3. Prototyping 4. Refinement 5. Solution
8. HSM is an Australian-based
non-profit focused on helping people
form healthier relationships with
alcohol.
Created the Daybreak app which is a
professional and community support
social network.
About Hello Sunday Morning
9. Daybreak
Members select a
mood they’re feeling.
Share how they’re feeling
by making a post.
Comment, like, and save
other members’ posts.
Set goals and
reminders.
11. Challenge
Moderators read every post and flag those which are potentially problematic-
either those that indicate potentially harmful behavior or those in breach
of community guidelines.
Moderators will either provide support or escalate members to a clinical
team.
HSM is facing the problem of growing memberships: the task of
moderators is becoming unmanageable with hundreds of thousands of
community activity (posts, comments, reactions) to review and flag if
necessary.
12. Ask
Moderators need assistance from an automated approach to
develop an efficient and scalable solution to flag and categorize
the risky or breach activity.
14. Data Provided
HSM provided historical (Jan-Sept 2019) , labeled post data
containing raw text (with PII removed), timestamp of post, and
risk/breach category.
Large amount of data but significant class imbalance (< 0.1% of
the posts were risky/breach)
15. Objective 1: Identify Risky Posts
A model was built to predict the probability of a post being risky.
Steps:
1) Remove weekend posts from the dataset.
2) Calculate lexicon-based sentiment score.
3) Clean text data.
4) Tokenize posts.
5) Create more features.
6) Train model.
16. Assessment of model
Threshold =
0.1
Threshold =
0.3
Threshold =
0.5
Threshold =
0.7
Recall 0.8 0.5 0.3 0.2
Precision 0.8 0.9 0.9 0.9
F1
score 0.8 0.7 0.5 0.4
Table 1: Model performance on test data at varying probability thresholds
The model was tested on a sample of post data unseen by the
model (Nov 2019 – Jan 2020)
HSM looking to use the threshold of 0.1 as to minimize the number of
false negatives
17. Objective 1: Identify Risky Posts
A keyword detector were built to predict indicate potentially risky
words/phrases in a post.
• Suicide
• Domestic Violence
• DUI
• Risky Behavior
• Detox Withdrawal
• Mental Health
• Self Harm
• Other
18. Objective 2: Identify Breach Posts
Pre-trained models were used to detect posts with PII or profanity.
Steps:
1) Detecting PII by utilizing pre-trained/off-the-shelf models for named
entity recognition and regex-based detection. Detects text related to
people, organizations, locations, dates, times, email addresses, phone
numbers, and street addresses.
2) Detecting Profanity by using an off-the-shelf regex-based model.
20. Deploying in Production
A REST API was built in Flask to enable usage of the solutions created.
• Pre-processing Data
• Feature Engineering
• Generating Model
Predictions/Outputs
API HSM
(Daybreak)
Request
Response
21. Deploying in Production
The API has three endpoints that HSM can utilize.
Outputs of the endpoints:
1. Probability a post is risky on a scale of 0-1.
2. Risk keywords in the post.
3. PII categories and words in the post.
22. Examples of API Output: Endpoint #1
Probability Risk: Probability a post is risky on a scale of 0-1.
API
{
share_content: “I feel like things are
starting to turn around for me.”,
created_at: “2020-01-01 09:17:42”
}
Request Response
{'Prediction Risk': 0.01}
23. Examples of API Output: Endpoint #2
Risk Keywords: Risk keywords in the post.
API
{
share_content: “I drank too much last
night and am now in a bad place.”,
created_at: “2019-11-24 08:18:31”
}
{
'DUI': [‘drank too much’],
'Mental Health': [‘bad place’]
}
Request Response
24. Examples of API Output: Endpoint #3
PII/Profanity: PII breaches in the post.
API
{
share_content: “Hey guys my name is
John, anyone in Sydney want to
meet up at Hyde Park this Saturday
at noon?”,
created_at: “2019-10-18 02:11:21”
}
{
'People': ['John’],
'Locations': ['Sydney', 'Hyde Park’],
'Dates': ['this Saturday’],
'Times': ['noon’]
}