Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You’ll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You’ll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Open Data Science Conference Agile DataDataKitchen
To rephrase an old saying: ‘It takes a village to raise an Analyst.’ Data Analysts and Scientists are working in teams delivering insight and analysis on an ongoing basis. So how do you get the team to support experimentation and insight delivery without ending up in an IT Engineer vs Analyst vs Data Governance war? We present 5 shocking steps to get these teams of people working together with practical, doable steps that can help you achieve data agility.
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.
To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.
In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.
This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Josh Levy-Kramer
The advances in machine learning are great, yet, in order to have real value within a company, data scientists must be able to go from a research project to a reproducible process. A common problem is that the code is intrinsically linked to the data it was developed against. Hence it is critically important to track, trace and validate the input data used to train and test the algorithm. This talk will be a review of the several tools which for data versioning and processing.
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more.
Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook!
We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Warehousing Your Hits - The Why and How of Owning Your DataScott Arbeitman
These are the slides from my recent presentation at Melbourne' Web Analytics Wednesdays. I talk about transitioning from collecting your data in primary digital analytics systems to storing them in a data warehouse or data lake.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Delivered by Josh Katz (Graphics Editor, The New York Times) at the 2016 New York R Conference on April 8th and 9th at Work-Bench. See the rest of the conference videos & presentations at http://www.rstats.nyc.
Open Data Science Conference Agile DataDataKitchen
To rephrase an old saying: ‘It takes a village to raise an Analyst.’ Data Analysts and Scientists are working in teams delivering insight and analysis on an ongoing basis. So how do you get the team to support experimentation and insight delivery without ending up in an IT Engineer vs Analyst vs Data Governance war? We present 5 shocking steps to get these teams of people working together with practical, doable steps that can help you achieve data agility.
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.
To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.
In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.
This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Josh Levy-Kramer
The advances in machine learning are great, yet, in order to have real value within a company, data scientists must be able to go from a research project to a reproducible process. A common problem is that the code is intrinsically linked to the data it was developed against. Hence it is critically important to track, trace and validate the input data used to train and test the algorithm. This talk will be a review of the several tools which for data versioning and processing.
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more.
Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook!
We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Warehousing Your Hits - The Why and How of Owning Your DataScott Arbeitman
These are the slides from my recent presentation at Melbourne' Web Analytics Wednesdays. I talk about transitioning from collecting your data in primary digital analytics systems to storing them in a data warehouse or data lake.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Delivered by Josh Katz (Graphics Editor, The New York Times) at the 2016 New York R Conference on April 8th and 9th at Work-Bench. See the rest of the conference videos & presentations at http://www.rstats.nyc.
Many of us data science and business analytics practitioners perform research and analysis for decision makers on a regular basis. The deliverable of such analysis often results in a Power Point presentation, and/or a model that needs to be productionalized. The code used to produce the analysis also needs to be considered a deliverable.
Many of us perform analysis without reproducibility in mind. With the increasing democratization of data, it is becoming more and more important for people that may not have scientific training to be able to create analysis that can be picked up by somebody else who can then reproduce your results. That, and creating reproducible research is just solid science.
We are going to spend an evening walking though the various tools available to create reproducible research on Big Data. You will get introduced to the Tidyverse of R packages and how to use them. We will discuss the ins and outs of various notebook technologies like Jupyter, and Zeppelin. You will have an opportunity to learn how to get up and running with R and Spark and the various options you have to learn on real clusters instead of just your local environment. There also be a quick introduction to source control and the various options you have around using Git.
The theme of the evening will be “getting started”. We will go over various training resources and show you the optimal path to go from zero to master. Some commentary will be provided around the current state of the job market and intel from the front lines of the data science language wars. This is a large topic and the evening will be fairly dynamic and responsive to the needs of the audience.
Bob Wakefield has spent the better part of 16 years building data systems for many organizations across various industries. He has been running Hadoop in a lab environment for 3 years. He is the principal of Mass Street Analytics, LLC a boutique data consultancy. Mass Street is a Hortonworks Consultant Partner and Confluent Partner.
In his spare time, he likes to work on an equity investment application that combines various sources of information to automatically arrive at investing decisions. When he is not doing that, you’ll find him flying his A-10 simulator. Full CV can be found here: https://www.linkedin.com/in/bobwakefieldmba/
Product Management in the Era of Data ScienceMandar Parikh
My slide-deck from a webinar on the same topic for the Institute of Product Leadership, April 4th, 2017
What does it take to build killer products in the “AI-first” era? What makes for a great Data Science-driven product and how do great Product Managers leverage Data Science to drive value for customers? Find out how to avoid the pitfalls of hype-chasing Data Science tactics. Learn how to work with Data Science and Engineering to build a compelling product and solve real problems.
Mandar takes a practitioner’s approach to present his recipe for success for building Data Science-driven products that drive enduring value for customers.
How to effectively deliver Data Science projects. This presentation is aimed at improving collaboration and communication between data science and data engineering. With Agile discipline, we could further improve the process of incremental value delivery.
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
Code: https://github.com/markwest1972/titanic
Video: https://vimeo.com/289705893
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all of this?
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Lean Analytics is a set of rules to make data science more streamlined and productive. It touches on many aspects of what a data scientist should be and how a data science project should be defined to be successful. During this presentation Richard will present where data science projects go wrong, how you should think of data science projects, what constitutes success in data science and how you can measure progress. This session will be loaded with terms, stories and descriptions of project successes and failures. If you're wondering whether you're getting value out of data science, how to get more value out of it and even whether you need it then this talk is for you!
What you will take away from this session
Learn how to make your data science projects successful
Evaluate how to track progress and report on the efficacy of data science solutions
Understand the role of engineering and data scientists
Understand your options for processes and software
NDC Oslo : A Practical Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this?
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
(1) I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
(2) Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
(3) The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Highlights and summary of long-running programmatic research on data science; practices, roles, tools, skills, organization models, workflow, outlook, etc. Profiles and persona definition for data scientist model. Landscape of org models for data science and drivers for capability planning. Secondary research materials.
There's a new breed of digital marketers who are employing data science practices to achieve better results more efficiently. Whether it's SEO, email marketing, marketing automation, response conversion, funnel optimization, or web analytics, these hybrid data scientists / data driven digital markets are using advanced mathematics, machine learning, predictive analytics, statistical modeling, and data mining in conjunction with marketing savvy to drive better results and grow companies faster. We'll discuss some case studies, best practices, and simple pseudo-data-science actions that even the non-data scientist can put to use immediately.
Data science is a multidisciplinary field that combines scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves analyzing, interpreting, and deriving actionable information from large and complex datasets to support decision-making and solve problems in various domains.
Key components of data science include:
Data Collection and Preparation: Data scientists gather and collect data from various sources, which may include databases, websites, sensors, social media, or other digital platforms. They clean, transform, and preprocess the data to ensure its quality and suitability for analysis.
Data Exploration and Visualization: Data scientists explore and visualize the data using statistical techniques and visualization tools. They look for patterns, trends, and relationships within the data to gain a deeper understanding of the underlying insights and potential correlations.
Machine Learning and Predictive Modeling: Data scientists apply machine learning algorithms and predictive modeling techniques to build models that can make predictions or classifications based on the available data. This involves training models on historical data and evaluating their performance on new or unseen data.
Statistical Analysis: Statistical analysis is a fundamental aspect of data science. Data scientists use statistical methods to analyze data, test hypotheses, identify significant variables, and quantify uncertainties to make informed decisions.
Data Interpretation and Communication: Data scientists interpret the results of their analysis and communicate their findings to stakeholders in a clear and meaningful way. They use data visualization techniques, storytelling, and data-driven insights to convey complex information and facilitate decision-making.
Domain Knowledge: Data scientists often work in specific domains or industries and require domain knowledge to understand the context and interpret the results effectively. This allows them to identify relevant variables, apply appropriate techniques, and generate actionable insights.
Data science has applications across various sectors, including finance, healthcare, marketing, retail, telecommunications, and more. It helps organizations gain a competitive advantage, optimize processes, identify trends, improve customer experiences, and drive data-informed decision-making.
To work in data science, proficiency in programming languages (such as Python or R), statistical knowledge, data manipulation skills, and experience with machine learning algorithms are typically required. Data scientists also need critical thinking, problem-solving abilities, and effective communication skills to effectively analyze data and communicate insights to both technical and non-technical stakeholders.
Data science is a multidisciplinary field that combines scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves analyzing, interpreting, and deriving actionable information from large and complex datasets to support decision-making and solve problems in various domains.
Key components of data science include:
Data Collection and Preparation: Data scientists gather and collect data from various sources, which may include databases, websites, sensors, social media, or other digital platforms. They clean, transform, and preprocess the data to ensure its quality and suitability for analysis.
Data Exploration and Visualization: Data scientists explore and visualize the data using statistical techniques and visualization tools. They look for patterns, trends, and relationships within the data to gain a deeper understanding of the underlying insights and potential correlations.
Machine Learning and Predictive Modeling: Data scientists apply machine learning algorithms and predictive modeling techniques to build models that can make predictions or classifications based on the available data. This involves training models on historical data and evaluating their performance on new or unseen data.
Statistical Analysis: Statistical analysis is a fundamental aspect of data science. Data scientists use statistical methods to analyze data, test hypotheses, identify significant variables, and quantify uncertainties to make informed decisions.
Data Interpretation and Communication: Data scientists interpret the results of their analysis and communicate their findings to stakeholders in a clear and meaningful way. They use data visualization techniques, storytelling, and data-driven insights to convey complex information and facilitate decision-making.
Domain Knowledge: Data scientists often work in specific domains or industries and require domain knowledge to understand the context and interpret the results effectively. This allows them to identify relevant variables, apply appropriate techniques, and generate actionable insights.
Data science has applications across various sectors, including finance, healthcare, marketing, retail, telecommunications, and more. It helps organizations gain a competitive advantage, optimize processes, identify trends, improve customer experiences, and drive data-informed decision-making.
To work in data science, proficiency in programming languages (such as Python or R), statistical knowledge, data manipulation skills, and experience with machine learning algorithms are typically required. Data scientists also need critical thinking, problem-solving abilities, and effective communication skills to effectively analyze data and communicate insights to both technical and non-technical stakeholders.
Data Security and Protection in DevOps Karen Lopez
Presentation to London #WinOps event Sept 2019. Focusing on data security, privacy, and protection on DevOps efforts. Includes data masking, dev and test, data, Alwasy Encrypted, and more.
Modernizing, Migrating & Mitigating - Moving to Modern Cloud & API Web Apps W...Security Innovation
This talk will help you, as a decision maker or architect, to understand the risks of migrating a thick client or traditional web application to the modern web. In this talk I’ll give you tools and techniques to make the migration to the modern web painless and secure so you can mitigate common pitfalls without having to make the mistakes first. I’ll be doing demos, and telling lots of stories throughout.
Making some good architectural decisions up front can help you:
- Minimize the risk of data breach
- Protect your user’s privacy
- Make security choices easy the easy default for your developers
- Understand the cloud security model
- Create defaults, policies, wrappers, and guidance for developers
- Detect when developers have bypassed security controls
The Next Generational Shift In Enterprise Infrastructure Has Arrived. If SlideShare is broken, please download report here: https://www.scribd.com/document/352452857/2017-Enterprise-Almanac
AI to Enable Next Generation of People ManagersWork-Bench
In our work with hundreds of top fast growth startups and globally-distributed Fortune 1000 corporations in our enterprise tech ecosystem here in NYC, the most common refrain we hear is: "managing people is hard."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/