Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
This teaches how to implement a data science project using Python.
You can watch the youtube video via this link https://goo.gl/Mi4aJH
Jupyter notebook: https://goo.gl/AxRMe3
TestGuild and QuerySurge Presentation -DevOps for Data TestingRTTS
This slide deck is from one of our 4 webinars in our half-day series in conjunction with Test Guild.
Chris Thompson and Mike Calabrese, Senior Solution Architects and QuerySurge experts, provide great information, a demo and lots of humor in this webinar on how to implement DevOps for Data in your DataOps pipeline.
This webinar was performed in conjunction with Test Guild.
To watch the video, go to:
https://youtu.be/1ihuRPgY_rs
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingSafe Software
Data streams are commonly defined as data that is continuously generated by different sources, which typically submit their data entries simultaneously, and in small sizes.
Despite lots of data being produced, not everyone knows how to extract value from these streams. With FME, this process is made easier than ever.
During this hour-long webinar, we’ll show you just how easy it is to get value out of data streams without having to hire a programming team. After a quick introduction to the world of stream processing, we will go through several scenarios to demonstrate, including:
- Filtering high volume streams
- Time windowing
- Group-based stream processing
- Advanced windowing & dynamic geofences
After this webinar, you’ll be full stream ahead with your data where and when you need it in no time.
Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
This teaches how to implement a data science project using Python.
You can watch the youtube video via this link https://goo.gl/Mi4aJH
Jupyter notebook: https://goo.gl/AxRMe3
TestGuild and QuerySurge Presentation -DevOps for Data TestingRTTS
This slide deck is from one of our 4 webinars in our half-day series in conjunction with Test Guild.
Chris Thompson and Mike Calabrese, Senior Solution Architects and QuerySurge experts, provide great information, a demo and lots of humor in this webinar on how to implement DevOps for Data in your DataOps pipeline.
This webinar was performed in conjunction with Test Guild.
To watch the video, go to:
https://youtu.be/1ihuRPgY_rs
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingSafe Software
Data streams are commonly defined as data that is continuously generated by different sources, which typically submit their data entries simultaneously, and in small sizes.
Despite lots of data being produced, not everyone knows how to extract value from these streams. With FME, this process is made easier than ever.
During this hour-long webinar, we’ll show you just how easy it is to get value out of data streams without having to hire a programming team. After a quick introduction to the world of stream processing, we will go through several scenarios to demonstrate, including:
- Filtering high volume streams
- Time windowing
- Group-based stream processing
- Advanced windowing & dynamic geofences
After this webinar, you’ll be full stream ahead with your data where and when you need it in no time.
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
The traditional approach to insurance pricing involves fitting a generalized linear model (GLM) to data collected on historical claims payments and premiums received. The explosive growth in data availability and increasing competitiveness in the marketplace are challenging actuaries to find new insights in their data and make predictions with more granularity, improved speed and efficiency, and with tighter integration among business units to support strategic decisions.
In this session we will share our experience implementing deep hierarchical neural networks using TensorFlow and PySpark on Databricks. We will discuss the benefits of the ML Runtime, our experience using the goofys mount, our process for hyperparameter tuning, specific considerations for the large dataset size and extreme volatility present in insurance data, among other topics.
Authors: Bryn Clark, Krish Rajaram
View Related videos:-
Truth about Supply Demand Planning:-
http://www.youtube.com/watch?v=K66q2o1ED3c
Demantra Vs Oracle Demand Planning
http://www.youtube.com/watch?v=QwAzP3T6ut4
Another slideshare PPT:-
http://www.slideshare.net/amitforu78/demantra-vs-oracle-demand-planning
Contact me at www.ezdia.com
<a>AsiaLinks</a>
A fresh new experience
Project offers a redesigned user experience that is simple and intuitive. Teams can quickly add new members and set up tasks, and then easily switch between grids, boards, or timeline (Gantt) charts to track progress. And because Project is part of the Microsoft 365 family, project teams can save time and do more with built-in connections to familiar apps like Microsoft Teams and Office.
Animated image of a timeline being worked on in Microsoft Project.
Collaboration made easy
Designed to do much more than just track progress, Project works with Teams to support collaboration and make it easy to manage all aspects of a team project, including file sharing, chats, meetings, and much more. Team members in scattered locations can even edit tasks simultaneously, so they can get more done together, no matter where they are. To help teams stay on track, Project offers an automated scheduling engine based on effort, duration, and resources.
Dataiku productive application to production - pap is may 2015 Dataiku
Beyond Predictive Analytics : Deploying apps to production and keep them improving
Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.
As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.
Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…
Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production
Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
Companies have valuable data that they might not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. With the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with an Amazon Redshift lead engineer to ask questions and learn more about how you can extend your analytics beyond your data warehouse.
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Informatik Aktuell
Qualitative hochwertige Testdaten in der richtigen Größe und Zusammensetzung, zur richtigen Zeit am richtigen Ort steigern nachweislich die Anwendungsqualität, reduzieren die Fehlerrate im produktiven Betrieb, erhöhen die Agilität der Anwendungsentwicklung und sparen somit erhebliche Kosten. Aber welcher Entwickler oder Tester möchte in der Ausübung seines Berufes etwas eigentlich Verbotenes tun, wenn er mit personenbezogenen Daten in Berührung kommt? Daher bedarf es klarer Richtlinien und Standards in Kombination mit geeigneten Werkzeugen, um möglichen Verletzungen des Bundesdatenschutzgesetzes vorzubeugen.
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark ScyllaDB
When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.
Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.
In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
In an era where making swift, data-driven decisions can set industry leaders apart, understanding the world of data streaming and stream processing is crucial. During this webinar, we'll explore:
Stream Processing Overview: Dive into what stream processing entails and the value it brings organizations.
Stream vs. Batch Processing: Learn the key differences and benefits of stream processing compared to traditional batch processing, highlighting the efficiency of real-time data handling.
Mastering Data Volumes: Discover strategies for effectively managing both high and low volume data streams, ensuring optimal performance.
Boosting Operational Excellence: Explore how adopting data streaming can enhance your organization's operational workflows and productivity.
Spatial Data's Role in Streams: Understand the importance of spatial data in stream processing for more informed decision-making.
Interactive Demos: Watch practical demos, from dynamic geofencing to group-based processing.
Plus, we’ll show you how you can do it without coding! Register now to take the first step towards more informed, timely, and precise decision-making for your organization.
Experience Mazda Zoom Zoom Lifestyle and Culture by Visiting and joining the Official Mazda Community at http://www.MazdaCommunity.org for additional insight into the Zoom Zoom Lifestyle and special offers for Mazda Community Members. If you live in Arizona, check out CardinaleWay Mazda's eCommerce website at http://www.Cardinale-Way-Mazda.com
The presentation gives an overview of the reasons for implementing a Manufacturing Intelligence strategy and how to justify the investment. Topics covered include:
-Manufacturing Intelligence Overview
-Business Drivers for Implementing a MI project
-What Data are we looking for?
-Developing the Business Case
-Execution Strategies for Success
-Some Challenges
Understanding Multitenancy and the Architecture of the Salesforce PlatformSalesforce Developers
Join us as we take a deep dive into the architecture of the Salesforce platform, explain how multitenancy actually works, and how it affects you as a developer. Showing the technology we use and the design principles we adhere to, you'll see how our platform teams manage three major upgrades a year without causing any issues to existing development. We'll cover the performance and security implications around the platform to give you an understanding of how limits have evolved. By the end of the session you'll have a better grasp of the architecture underpinning Force.com and understand how to get the most out of it.
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesMatt Stubbs
Data architecture for a challenger bank.Speaker: Jason Maude, Head of Technology Advocacy, Starling BankSpeaker Bio: Jason Maude is a coder, coach, and public speaker. He has over a decade of experience working in the financial sector, primarily in creating and delivering software. He is passionate about explaining complex technical concepts to those who are convinced that they won't be able to understand them. He currently works at Starling Bank as their Head of Technology Advocacy and host of the Starling podcast.Filmed at Skills Matter/Code Node London on 9th May 2019 as part of the Big Data LDN Meetup Blueprint Series.Meetup sponsored by DataStax.
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Matt Stubbs
Speaker: Cedrick Lunven, Developer Advocate, DataStax
Speaker Bio: Cedrick is a Developer Advocate at DataStax where he finds opportunities to share his passions by speaking about developing distributed architectures and implementing reference applications for developers. In 2013, he created FF4j, an open source framework for Feature Toggle which he still actively maintains. He is now contributor in JHipster team.
Talk Synopsis: We have all introduced more or less functional programming and asynchronous operations into our applications in order to speed up and distribute treatments (e.g., multi-threading, future, completableFuture, etc.). To build truly non-blocking components, optimize resource usage, and avoid "callback hell" you have to think reactive—everything is an event.
From the frontend UI to database communications, it’s now possible to develop Java applications as fully reactive with frameworks like Spring WebFlux and Reactor. With high throughput and tunable consistency, applications built on top of Apache Cassandra™ fit perfectly within this pattern.
DataStax has been developing Apache Cassandra drivers for years, and in the latest version of the enterprise driver we introduced reactive programming.
During this session we will migrate, step by step, a vanilla CRUD Java service (SpringBoot / SpringMVC) into reactive with both code review and live coding. Bring home a working project!
Filmed at Skills Matter/Code Node London on 9th May 2019 as part of the Big Data LDN Meetup Blueprint Series.
Meetup sponsored by DataStax.
More Related Content
Similar to Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done?
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
The traditional approach to insurance pricing involves fitting a generalized linear model (GLM) to data collected on historical claims payments and premiums received. The explosive growth in data availability and increasing competitiveness in the marketplace are challenging actuaries to find new insights in their data and make predictions with more granularity, improved speed and efficiency, and with tighter integration among business units to support strategic decisions.
In this session we will share our experience implementing deep hierarchical neural networks using TensorFlow and PySpark on Databricks. We will discuss the benefits of the ML Runtime, our experience using the goofys mount, our process for hyperparameter tuning, specific considerations for the large dataset size and extreme volatility present in insurance data, among other topics.
Authors: Bryn Clark, Krish Rajaram
View Related videos:-
Truth about Supply Demand Planning:-
http://www.youtube.com/watch?v=K66q2o1ED3c
Demantra Vs Oracle Demand Planning
http://www.youtube.com/watch?v=QwAzP3T6ut4
Another slideshare PPT:-
http://www.slideshare.net/amitforu78/demantra-vs-oracle-demand-planning
Contact me at www.ezdia.com
<a>AsiaLinks</a>
A fresh new experience
Project offers a redesigned user experience that is simple and intuitive. Teams can quickly add new members and set up tasks, and then easily switch between grids, boards, or timeline (Gantt) charts to track progress. And because Project is part of the Microsoft 365 family, project teams can save time and do more with built-in connections to familiar apps like Microsoft Teams and Office.
Animated image of a timeline being worked on in Microsoft Project.
Collaboration made easy
Designed to do much more than just track progress, Project works with Teams to support collaboration and make it easy to manage all aspects of a team project, including file sharing, chats, meetings, and much more. Team members in scattered locations can even edit tasks simultaneously, so they can get more done together, no matter where they are. To help teams stay on track, Project offers an automated scheduling engine based on effort, duration, and resources.
Dataiku productive application to production - pap is may 2015 Dataiku
Beyond Predictive Analytics : Deploying apps to production and keep them improving
Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.
As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.
Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…
Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production
Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
Companies have valuable data that they might not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. With the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with an Amazon Redshift lead engineer to ask questions and learn more about how you can extend your analytics beyond your data warehouse.
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Informatik Aktuell
Qualitative hochwertige Testdaten in der richtigen Größe und Zusammensetzung, zur richtigen Zeit am richtigen Ort steigern nachweislich die Anwendungsqualität, reduzieren die Fehlerrate im produktiven Betrieb, erhöhen die Agilität der Anwendungsentwicklung und sparen somit erhebliche Kosten. Aber welcher Entwickler oder Tester möchte in der Ausübung seines Berufes etwas eigentlich Verbotenes tun, wenn er mit personenbezogenen Daten in Berührung kommt? Daher bedarf es klarer Richtlinien und Standards in Kombination mit geeigneten Werkzeugen, um möglichen Verletzungen des Bundesdatenschutzgesetzes vorzubeugen.
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark ScyllaDB
When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.
Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.
In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
In an era where making swift, data-driven decisions can set industry leaders apart, understanding the world of data streaming and stream processing is crucial. During this webinar, we'll explore:
Stream Processing Overview: Dive into what stream processing entails and the value it brings organizations.
Stream vs. Batch Processing: Learn the key differences and benefits of stream processing compared to traditional batch processing, highlighting the efficiency of real-time data handling.
Mastering Data Volumes: Discover strategies for effectively managing both high and low volume data streams, ensuring optimal performance.
Boosting Operational Excellence: Explore how adopting data streaming can enhance your organization's operational workflows and productivity.
Spatial Data's Role in Streams: Understand the importance of spatial data in stream processing for more informed decision-making.
Interactive Demos: Watch practical demos, from dynamic geofencing to group-based processing.
Plus, we’ll show you how you can do it without coding! Register now to take the first step towards more informed, timely, and precise decision-making for your organization.
Experience Mazda Zoom Zoom Lifestyle and Culture by Visiting and joining the Official Mazda Community at http://www.MazdaCommunity.org for additional insight into the Zoom Zoom Lifestyle and special offers for Mazda Community Members. If you live in Arizona, check out CardinaleWay Mazda's eCommerce website at http://www.Cardinale-Way-Mazda.com
The presentation gives an overview of the reasons for implementing a Manufacturing Intelligence strategy and how to justify the investment. Topics covered include:
-Manufacturing Intelligence Overview
-Business Drivers for Implementing a MI project
-What Data are we looking for?
-Developing the Business Case
-Execution Strategies for Success
-Some Challenges
Understanding Multitenancy and the Architecture of the Salesforce PlatformSalesforce Developers
Join us as we take a deep dive into the architecture of the Salesforce platform, explain how multitenancy actually works, and how it affects you as a developer. Showing the technology we use and the design principles we adhere to, you'll see how our platform teams manage three major upgrades a year without causing any issues to existing development. We'll cover the performance and security implications around the platform to give you an understanding of how limits have evolved. By the end of the session you'll have a better grasp of the architecture underpinning Force.com and understand how to get the most out of it.
Similar to Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done? (20)
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesMatt Stubbs
Data architecture for a challenger bank.Speaker: Jason Maude, Head of Technology Advocacy, Starling BankSpeaker Bio: Jason Maude is a coder, coach, and public speaker. He has over a decade of experience working in the financial sector, primarily in creating and delivering software. He is passionate about explaining complex technical concepts to those who are convinced that they won't be able to understand them. He currently works at Starling Bank as their Head of Technology Advocacy and host of the Starling podcast.Filmed at Skills Matter/Code Node London on 9th May 2019 as part of the Big Data LDN Meetup Blueprint Series.Meetup sponsored by DataStax.
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Matt Stubbs
Speaker: Cedrick Lunven, Developer Advocate, DataStax
Speaker Bio: Cedrick is a Developer Advocate at DataStax where he finds opportunities to share his passions by speaking about developing distributed architectures and implementing reference applications for developers. In 2013, he created FF4j, an open source framework for Feature Toggle which he still actively maintains. He is now contributor in JHipster team.
Talk Synopsis: We have all introduced more or less functional programming and asynchronous operations into our applications in order to speed up and distribute treatments (e.g., multi-threading, future, completableFuture, etc.). To build truly non-blocking components, optimize resource usage, and avoid "callback hell" you have to think reactive—everything is an event.
From the frontend UI to database communications, it’s now possible to develop Java applications as fully reactive with frameworks like Spring WebFlux and Reactor. With high throughput and tunable consistency, applications built on top of Apache Cassandra™ fit perfectly within this pattern.
DataStax has been developing Apache Cassandra drivers for years, and in the latest version of the enterprise driver we introduced reactive programming.
During this session we will migrate, step by step, a vanilla CRUD Java service (SpringBoot / SpringMVC) into reactive with both code review and live coding. Bring home a working project!
Filmed at Skills Matter/Code Node London on 9th May 2019 as part of the Big Data LDN Meetup Blueprint Series.
Meetup sponsored by DataStax.
Blueprint Series: Expedia Partner Solutions, Data PlatformMatt Stubbs
Join Anselmo for an engaging overview of the new end-to-end data architecture at Expedia Group, taking a journey through cloud and on-prem data lakes, real-time and batch processes and streamlined access for data producers and consumers. Find out how the new architecture unifies a complex mix of data sources and feeds the data science development cycle. Expedia might appear to be a market-leading travel company – in reality, it’s a highly successful technology and data science company.
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Matt Stubbs
Richard Freeman talks about how the data science team at JustGiving built KOALA, a fully serverless stack for real-time web analytics capture, stream processing, metrics API, and storage service, supporting live data at scale from over 26M users. He discusses recent advances in serverless computing, and how you can implement traditionally container-based microservice patterns using serverless-based architectures instead. Deploying Serverless in your organisation can dramatically increase the delivery speed, productivity and flexibility of the development team, while reducing the overall running, DevOps and maintenance costs.
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEMatt Stubbs
Date: 14th November 2018
Location: Customer Experience Theatre
Time: 12:30 - 13:00
Speaker: David Maitland
Organisation: Redis Labs
About: This session will cover the technology underpinning at the software infrastructure level required to deliver the instant experience to the end user and enterprises alike. Use cases and value derived by major brands will be shared in this insightful session based the world's most loved database REDIS.
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLMatt Stubbs
Date: 14th November 2018
Location: Customer Experience Theatre
Time: 11:50 - 12:20
Speaker: Perry Krug
Organisation: Couchbase
About: Who wants to see an ad today for the shoes they bought last week? Everyone knows that customer experience is driven by data: don't waste an opportunity to get them the right data at the right time. Real-time results are critical, but raw speed isn't everything: you need power and flexibility to react to changes on the fly. Come learn how market-leading enterprises are using Couchbase as their speed layer for ingestion, incremental view and presentation layers alongside Kafka, Spark and Hadoop to liberate their data lakes.
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSMatt Stubbs
Date: 13th November 2018
Location: Customer Experience Theatre
Time: 11:50 - 12:20
Speaker: Charlotte Emms
Organisation: seenit
About: How do you get your colleagues interested in the power of data? Taking you through Seenit’s journey using Couchbase's NoSQL database to create a regular, fully automated update in an easily digestible format.
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
Date: 14th November 2018
Location: Governance and MDM Theatre
Time: 10:30 - 11:00
Speaker: Mike Ferguson
Organisation: IBS
About: For most organisations today, data complexity has increased rapidly. In the area of operations, we now have cloud and on-premises OLTP systems with customers, partners and suppliers accessing these applications via APIs and mobile apps. In the area of analytics, we now have data warehouse, data marts, big data Hadoop systems, NoSQL databases, streaming data platforms, cloud storage, cloud data warehouses, and IoT-generated data being created at the edge. Also, the number of data sources is exploding as companies ingest more and more external data such as weather and open government data. Silos have also appeared everywhere as business users are buying in self-service data preparation tools without consideration for how these tools integrate with what IT is using to integrate data. Yet new regulations are demanding that we do a better job of governing data, and business executives are demanding more agility to remain competitive in a digital economy. So how can companies remain agile, reduce cost and reduce the time-to-value when data complexity is on the up?
In this session, Mike will discuss how companies can create an information supply chain to manufacture business-ready data and analytics to reduce time to value and improve agility while also getting data under control.
Date: 13th November 2018
Location: Governance and MDM Theatre
Time: 12:30 - 13:00
Organisation: Immuta
About: Artificial intelligence is rising in importance, but it’s also increasingly at loggerheads with data protection regimes like the GDPR—or so it seems. In this talk, Sophie will explain where and how AI and GDPR conflict with one another, and how to resolve these tensions.
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Matt Stubbs
Date: 13th November 2018
Location: Governance and MDM Theatre
Time: 11:50 - 12:20
Speaker: Mark Pritchard
Organisation: Denodo
About: Self-service analytics promises to liberate business users to perform analytics without the assistance of IT, and this in turn promises to free IT to focus on enhancing the infrastructure.
Join us to learn how data virtualization will allow you to gain real-time access to enterprise-wide data and deliver self-service analytics. We will explore how you can seamlessly unify fragmented data, replace your high-maintenance and high cost data integrations with a single, low-maintenance data virtualization layer; and how you can preserve your data integrity and ensure data lineage is fully traceable.
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Matt Stubbs
Date: 13th November 2018
Location: Governance and MDM Theatre
Time: 11:10 - 11:40
Organisation: TIBCO
About: The big data phenomenon continues to accelerate, resulting in multiple data lakes at most organisations. However, according to Gartner, “Through 2019, 90% of the information assets from big data analytic efforts will be siloed and unusable across multiple business processes.”
Are you ready to unleash this data from these silos and deliver the insights your organisation needs to drive compelling customer experiences, innovative new products and optimized operations? In this session you will learn how to apply data virtualisation to: - Access, transform and deliver data from across your lakes, clouds and other data sources - Empower a range of analytic users and tools with all the data they need - Move rapidly to a modern and flexible data architecture for the long run In addition, you will see a demonstration of data virtualisation in action.
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Matt Stubbs
Date: 14th November 2018
Location: Data-Driven Ldn Theatre
Time: 12:30 - 13:00
Organisation: Cloudera
About: The growth of public cloud is reinforcing the need to think more carefully about taking a consistent approach to data governance as technology teams build out a flexible and agile infrastructure to meet the demands of the business.
Join this session to learn more about Cloudera's recommended approach for enterprise-grade security and governance and how to ensure a consistent framework across private, public and on-premises environments.
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSMatt Stubbs
Date: 14th November 2018
Location: Data-Driven Ldn Theatre
Time: 11:10 - 11:40
Organisation: Microlise
About: Microlise are a leading provider of technology solutions to the transport and logistics industry worldwide. Discover how, with over 400,000 connected assets generating billions of messages a day, Microlise is evolving its platform to bring real-time analytics to its customers to improve safety, security and efficiency outcomes.
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEMatt Stubbs
Date: 14th November 2018
Location: Data-Driven Ldn Theatre
Time: 10:30 - 11:00
Speaker: Anna Matty
Organisation: Experian
About: Today there is a widespread focus on the 'how' in relation to problem solving. How can we gain better knowledge of what consumers want, or need? How can we be more efficient, reduce the cost to serve, or grow the lifetime value of a customer? But, how do you move to a place where you are not only solving a problem, you are redesigning the entire strategic potential of that problem? You are being armed with insight on what the problem is.
Data and innovation offer huge potential to revolutionise all markets. There is an opportunity to be one step ahead of the need, to redesign journeys and enhance enterprise strategies. To do this you need access to the most advanced analytics but also the best quality, including variations and types of data, and then the technology that can act on this insight. Data science can present a unique opportunity for uncovered growth and accelerate your business through strategic innovation – fast. In this session you will hear more about how today's analytics can move from a single task, to an ongoing strategic opportunity. An opportunity that helps you move at the speed of the market and helps you maximise every opportunity.
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGMatt Stubbs
Date: 13th November 2018
Location: Data-Driven Ldn Theatre
Time: 13:10 - 13:40
Speaker: Brian Goral
Organisation: Cloudera
About: The field of machine learning (ML) ranges from the very practical and pragmatic to the highly theoretical and abstract. This talk describes several of the challenges facing organisations that want to leverage more of their data through ML, including some examples of the applied algorithms that are already delivering value in business contexts.
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Matt Stubbs
Date: 13th November 2018
Location: Data-Driven Ldn Theatre
Time: 12:30 - 13:00
Speaker: Paul Wilkinson, Naveen Gupta
Organisation: Cloudera
About: Investment banks are faced with some of the toughest regulatory requirements in the world. In a market where data is increasing and changing at extraordinary rates the journey with data governance never ends.
In this session, Deutsche Bank will share their journey with big data and explain some of the processes and techniques they have employed to prepare the bank for today’s challenges and tomorrow’s opportunities.
Brought to you by Naveen Gupta, VP Software Engineering, Deutsche Bank and Paul Wilkinson, Principal Solutions Architect, Cloudera.
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Matt Stubbs
Date: 14th November 2018
Location: Self-Service Analytics Theatre
Time: 13:50 - 14:20
Speaker: Stephanie McReynolds
Organisation: Alation
About: Raw data is proliferating at an enormous rate. But so are our derived data assets - hundreds of dashboards, thousands of reports, millions of transformed data sets. Self-service analytics have ensured that this noise is making it increasingly hard to understand and trust data for decision-making. This trust gap is holding your organisation back from business outcomes.
European analytics leaders have found a way to close the gap between data and decision-making. From MunichRe to Pfizer and Daimler, analytics teams are adopting data catalogues for thousands of self-service analytics users.
Join us in this session to hear how data catalogues that activate data by incorporating machine learning can:
• Increase analyst productivity 20-40%
• Boost the understanding of the nuances of data and
• Establish trust in data-driven decisions with agile stewardship
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEMatt Stubbs
Date: 13th November 2018
Location: Self-Service Analytics Theatre
Time: 15:50 - 16:20
Speaker: Nishanth Kadiyala
Organisation: Progress
About: The exploding API economy, combined with an advanced analytics market projected to reach $30 billion by 2019, is forcing IT to expose more and more data through APIs. Business analysts, data engineers, and data scientists are still not happy because their needs never really made it into the existing API strategies. This is because most APIs are designed for application integration, but not for the data workers who are looking for APIs that facilitate direct data access to run complex analytics. Data APIs are specifically designed to provide that frictionless data access experience to support analytics across standard interoperable interfaces such as OData (REST) or ODBC/JDBC (SQL). Consider expanding your API strategy to service the developers with open analytics in this $30 billion market.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
3. Before we created
Match2Lists
We needed to match millions of records of our customers and 3rd party data
We ran a B2B Consulting firm providing Segmentation & Data Visualisation
4. To many false-positives and
30%-40% missed matches
Phoenix Ltd Fenix
Fuzzy
Match
Fuzzy
Non-Match
GSK PLC
GlaxoSmithKline
Beecham (met at
conference)
So we tried most Fuzzy Logic
software
Why ?
Why not ?
9. We developed more advanced
ata matching algorithms & Approac
Corroborative matching
Iterative matching
Contextual fuzzy logic
Probabilistic logic
word order permutations
Noise word elimination
character transformations
Synonym analysis
19. Unilever Beteiligungs Gmbh
Ge Medical Systems Private Limited
General Electric Company
Stichting Administratiekantoor
Unilever N.V.
Unilever Plc
DE-duplicate data easily
20. Unilever Beteiligungs Gmbh
Ge Medical Systems Private Limited
General Electric Company
Stichting Administratiekantoor
Unilever N.V.
Unilever Plc
De-Dupe
DE-duplicate data easily
22. Customer Data Wallet Size DataDun & Bradstreet Data
Your CRM Data
STEP3 D&B Data
Wallet Size Data
Merge Match2DnBMatch
Blend data from different sources
23. No technical skills required
Anyone can use it
Strategy
Analysts
Sales &
Marketing
Finance &
operations
24. Disk-Memory Data Exchange
Despite the data compression, the data
exchange between disk and memory is both
efficient and rapid.
Scripting Functionality
Excellent scripting feature allows us to write our
own User Defined Functions that run at high
speed.
Less MEMORY = Less Cost
EXASOL required only 10% to 20% of the memory
configuration of our previous solution when parallel
ran both solutions during the transition phase
5 Minute Reboot Time !
The ability to reboot Match2Lists in 5 minutes to
perform system upgrades means practically no
disruption for our customers
Speed and Performance
Data matching is 3X faster than our prior
solution : 10 seconds to match 5 Million records
30 seconds to match 200 million records
Data Compression is Excellent
Data compression is impressive which translates to
lower hardware requirements. As customers and
their data continue to grow, this is a key benefit.
Match2Lists’ Experience with
Outstanding Support and Great Teamwork
Fast, Smooth and Faultless Transition to EXASOL
27. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
28. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
02 Aug’16USASalesForce – CRM Account active 168,287
CRM Data
11 Jul’16
20 Mar’16
USA
USA
Addressable Market – Top 4000 Companies
MarTech – San Francisco Registrants
active
active
11,827
928
Subscriber Data
05 Jun’16
20 Aug’16
01 May’16
*G*
*G*
*G*
Forbes – 2000 & Worldwide Subsidiaries
Segmetrix Top 2500 by Wallet Size
Our Global Segment 500 Accounts
active
active
active
434,230
2500
500
Reference Data
23 Jun’16
15 Jun’16
DEU
DEU
Channel Partner 1 – Sales Out
Channel Partner 2 – Sales Out
active
active
18,231
34,109
Partner Data 01 Sep’16
18 Aug’16
UK
UK
Rhetorik UK – 25K Sites
D&B Top Companies– Tech & Finance
active
active
23,800
890
Contact Lists
29. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
Company ID
Address
Address
Address
Manually select
Field types from
menu
Check auto-
detected field types
field types
menu
30. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
31. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
Corroborative matching
Iterative matching
Fuzzy logic only when applicable
Probabilistic logic
All word order permutations
Noise word elimination
Special character transformations
Synonym analysis
32. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
33. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
THE MATCH VISUALISER
34. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
THE MATCH VISUALISER
Objective : Maximise Match
Rate
1st match setting
✔ Select fields to use
★ set similarity strengths
30SECON
DS
UNDE
R
click any or each score band to assess
results
APPROVE
ENTIRE
Score ranges
IF Results look
good
Down to this level
56%
2nd
RUNMatch Setting
Approve results
You’ve approved
93%
DOWNLO
AD
Results
35. 5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing
all DONE !
That’s IT Select which
fields you want
to download
from each LIST
36. Post / Zip CodeAddress Fields ( 1, 2 and 3 )Company Name City Country
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
CF14 7YT
ME14 2LE
TA2 8QY
LS11 5AD
WF6 1TN
26-30 Uxbridge Road
121-141 Westbourne Terrace,
10 Cabot Square, Canary Wharf
1 Knightsbridge Green
Maynard Centre, Forest Farm
Springfield Mill J Whatman Wa y
Crown Industrial Estate, Priorswood Road
Asda House, Southbank, Great Wilson St
Unit 1, Foxbridge Way
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
GE Healthcare
Whatman plc
Amphenol Limited
ASDA Stores Limited
International Procurement & Logistics
London
London
London
London
Wales
S West
N East
N East
UK
UK
UK
UK
UK
UK
UK
UK
UK
NR3 1PD
WF10 5QL
St Crispins, Duke Street
Witwood Common Lane, Witwood
Stationery Office (UK ltd)
DHL Supply Chain
East
N East
UK
UK
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
26-30 Uxbridge Road
121-141 Westbourne Terrace,
10 Cabot Square, Canary Wharf
1 Knightsbridge Green
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
London
London
London
London
WPP PLC
WPP PLC
WPP PLC
WPP PLC
UK
UK
UK
UK
120376
120376
120376
120376
2839
2839
2839
2839
UK
UK
UK
UK
CF14 7YT
ME14 2LE
TA2 8QY
Maynard Centre, Forest Farm
Springfield Mill J Whatman Wa y
Crown Industrial Estate, Priorswood Road
GE Healthcare
Whatman plc
Amphenol Limited
Wales
S West
General Electric Company
General Electric Company
General Electric Company
USA
USA
USA
5929
5929
5929
5578
5578
5578
UK
UK
UK
LS11 5AD
WF6 1TN
Asda House, Southbank, Great Wilson St
Unit 1, Foxbridge Way
ASDA Stores Limited
International Procurement & Logistics
N East
N East
Wal-Mart Stores, Inc.
Wal-Mart Stores, Inc.
USA
USA
180339
180339
8079
8079
UK
UK
Global Ultimate Company HQ Country WW Emp SIC Code
NR3 1PD
WF10 5QL
St Crispins, Duke Street
Witwood Common Lane, Witwood
Stationery Office (UK ltd)
DHL Supply Chain
East
N East
Deutsche Post AG
Deutsche Post AG
Germany
Germany
6313
6313
4669
4669
UK
UK
Download ResultsSelect ProjectUpload your Data Review MatchesPreprocessing
SOURCE LIST MATCH LIST
Global UltimateID
Global UltimateParent Name
Design your output file
Select what fields
you want from your
source list
Select the fields
of the matched
records
WW Emp
Site Name
Site Address 1
Site Address 2
Site Address
Site State / County
Site Post Code
SIC Code
WPP PLC
WPP PLC
WPP PLC
WPP PLC
UK
UK
UK
UK
120376
120376
120376
120376
2839
2839
2839
2839
General Electric Company
General Electric Company
General Electric Company
USA
USA
USA
5929
5929
5929
5578
5578
5578
Wal-Mart Stores, Inc.
Wal-Mart Stores, Inc.
USA
USA
180339
180339
8079
8079
Deutsche Post AG
Deutsche Post AG
Germany
Germany
6313
6313
4669
4669
37. Post / Zip CodeAddress Fields ( 1, 2 and 3 )Company Name City Country
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
CF14 7YT
ME14 2LE
TA2 8QY
LS11 5AD
WF6 1TN
26-30 Uxbridge Road
121-141 Westbourne Terrace,
10 Cabot Square, Canary Wharf
1 Knightsbridge Green
Maynard Centre, Forest Farm
Springfield Mill J Whatman Wa y
Crown Industrial Estate, Priorswood Road
Asda House, Southbank, Great Wilson St
Unit 1, Foxbridge Way
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
GE Healthcare
Whatman plc
Amphenol Limited
ASDA Stores Limited
International Procurement & Logistics
London
London
London
London
Wales
S West
N East
N East
UK
UK
UK
UK
UK
UK
UK
UK
UK
NR3 1PD
WF10 5QL
St Crispins, Duke Street
Witwood Common Lane, Witwood
Stationery Office (UK ltd)
DHL Supply Chain
East
N East
UK
UK
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
26-30 Uxbridge Road
121-141 Westbourne Terrace,
10 Cabot Square, Canary Wharf
1 Knightsbridge Green
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
London
London
London
London
UK
UK
UK
UK
CF14 7YT
ME14 2LE
TA2 8QY
Maynard Centre, Forest Farm
Springfield Mill J Whatman Wa y
Crown Industrial Estate, Priorswood Road
GE Healthcare
Whatman plc
Amphenol Limited
Wales
S West
UK
UK
UK
LS11 5AD
WF6 1TN
Asda House, Southbank, Great Wilson St
Unit 1, Foxbridge Way
ASDA Stores Limited
International Procurement & Logistics
N East
N East
UK
UK
Global Ultimate Company HQ Country WW Emp Industry
NR3 1PD
WF10 5QL
St Crispins, Duke Street
Witwood Common Lane, Witwood
Stationery Office (UK ltd)
DHL Supply Chain
East
N East
WPP PLC
WPP PLC
WPP PLC
WPP PLC
UK
UK
UK
UK
120376
120376
120376
120376
2839
2839
2839
2839
General Electric Company
General Electric Company
General Electric Company
USA
USA
USA
5929
5929
5929
5578
5578
5578
Wal-Mart Stores, Inc.
Wal-Mart Stores, Inc.
USA
USA
180339
180339
8079
8079
Deutsche Post AG
Deutsche Post AG
Germany
Germany
6313
6313
4669
4669
UK
UK
Data visualization Data warehouseCRM