The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
Solving DEBS Grand Challenge with WSO2 CEPSrinath Perera
The DEBS Grand Challenge is an annual event in which different event-based systems compete to solve a real-world problem. The 2014 challenge is to demonstrate scalable real- time analytics using high-volume sensor data collected from smart plugs over a one and a half month period. This paper aims to show how a general-purpose commercially available event-based system - the WSO2 Complex Event Processor (WSO2 CEP) - was used to solve this problem. We achieved 300k TPS with one node and neared 1Millions TPS with 4 nodes. In addition, we explore areas where we created extensions to the WSO2 CEP engine to better solve the challenge.
Presentation by Peter Fontaine, CBO's Assistant Director for Budget Analysis, to a Global Network of Parliamentary Budget Offices Community Meeting Sponsored by the World Bank Institute
Vendor-neutral presentation about the common functionality provided by data profiling tools, which can help automate some of the work needed to begin your preliminary data analysis.
Data visualization has enabled us to compress data and express them visually in many interesting new ways. It is often cited that we are trying to tell stories through them. Is that really the case? How can we ensure that the audience is able to retain, recall and retell our data-driven stories.
Using examples and videos learning from different storytelling mediums, I walk through why stories are important and what we can learn about how stories work from these mediums. I then detail out a framework built on the See the data | Show the Visual | Tell the Story | Engage the Audience paradigm to convert the data in to a data-visual-story.
This slide deck was used in Bangalore Meetup - Crafting Visual Stories with Data - in March 2014 @ InMobi's Bangalore Office.
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
Speaker: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics around data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany.
Solving DEBS Grand Challenge with WSO2 CEPSrinath Perera
The DEBS Grand Challenge is an annual event in which different event-based systems compete to solve a real-world problem. The 2014 challenge is to demonstrate scalable real- time analytics using high-volume sensor data collected from smart plugs over a one and a half month period. This paper aims to show how a general-purpose commercially available event-based system - the WSO2 Complex Event Processor (WSO2 CEP) - was used to solve this problem. We achieved 300k TPS with one node and neared 1Millions TPS with 4 nodes. In addition, we explore areas where we created extensions to the WSO2 CEP engine to better solve the challenge.
Presentation by Peter Fontaine, CBO's Assistant Director for Budget Analysis, to a Global Network of Parliamentary Budget Offices Community Meeting Sponsored by the World Bank Institute
Vendor-neutral presentation about the common functionality provided by data profiling tools, which can help automate some of the work needed to begin your preliminary data analysis.
Data visualization has enabled us to compress data and express them visually in many interesting new ways. It is often cited that we are trying to tell stories through them. Is that really the case? How can we ensure that the audience is able to retain, recall and retell our data-driven stories.
Using examples and videos learning from different storytelling mediums, I walk through why stories are important and what we can learn about how stories work from these mediums. I then detail out a framework built on the See the data | Show the Visual | Tell the Story | Engage the Audience paradigm to convert the data in to a data-visual-story.
This slide deck was used in Bangalore Meetup - Crafting Visual Stories with Data - in March 2014 @ InMobi's Bangalore Office.
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
Speaker: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics around data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
Practical deep learning for computer visionEran Shlomo
This is the presentation given in TLV DLD 2017. In this presentation we walk through the planning and implemintation of deeplearning solution for image recognition, with focus on the data.
It is based on the work we do at dataloop.ai and its customers.
With datasets becoming larger, new visualizations techniques are needed. One problem is that the number of datapoints is so large, that a scatter plot becomes cluttered. Another problem is that with over a billion objects, only a few cpu cycles are available per object if one wants to process them within a second, making traditional methods not viable.
I will show that it is possible to visualize a billion objects in about 1 second on a modern desktop computer using memory mapping of hdf5 files together with a simple binning algorithm in Python. Density fields in 1, 2 and 3d are computed and statistics of any properties of the objects, or any mathematical operation on them can be done on the fly. This enables efficient exploration or large datasets interactively, making science exploration feasible.
This idea is implemented in a Python library called vaex, which integrates well in the Jupyter/Numpy/Astropy stack. Build on top of this is the vaex application, which allows for interactive exploration and visualization.
The motivation for vaex is the upcoming astronomical catalogue of the Gaia satellite, which will contain properties of over a billion stars. However, vaex can also be used on N-body simulations, other catalogues or any tabular data.
https://www.astro.rug.nl/~breddels/vaex
https://github.com/maartenbreddels/vaex
From USENIX LISA 2010, San Jose.
Visualizations that include heat maps can be an effective way to present performance data: I/O latency, resource utilization, and more. Patterns can emerge that would be difficult to notice from columns of numbers or line graphs, which are revealing previously unknown behavior. These visualizations are used in a product as a replacement for traditional metrics such as %CPU and are allowing end users to identify more issues much more easily (and some issues are becoming nearly impossible to identify with tools such as vmstat(1)). This talk covers what has been learned, crazy heat map discoveries, and thoughts for future applications beyond performance analysis.
Make Life Suck Less (Building Scalable Systems)guest0f8e278
This presentation was given at LinkedIn. It is a collection of guidelines and wisdom for re-thinking how we do engineering for massively scalable systems. Useful for anyone who cares about Big Data, Distributed Computing, Hadoop, and more.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
This slide deck explains a bit how to deal best with state in scalable systems, i.e. pushing it to the system boundaries (client, data store) and trying to avoid state in-between.
Then it picks arbitrarily two scenarios - one in the frontend part and one in the backend part of a system and shows concrete techniques to deal with them.
In the frontend part is examined how to deal with session state of servlet containers in scalable scenarios and introduces the concept of a shared session cache layer. Also an example implementation using Redis is shown.
In the backend part it is examined how to deal with potential data inconsistencies that can occur if maximum availability of the data store is required and eventual consistency is used. The normal way is to resolve inconsistencies manually implementing business specific logic or - even worse - asking the user to resolve it. A pure technical solution called CRDTs (Conflict-free Replicated Data Types) is then shown. CRDTs, based on sound mathematical concepts, are self-stabilizing data structures that offer a generic way to resolve inconsistencies in an eventual consistent data store. Besides some theory also some examples are shown to provide a feeling how CRDTs feel in practice.
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
Speaker: Oscar Westra van Holthe - Kind
Genre & level: Backend, Medior
Data congestion slows down even the fastest database. Want to avoid getting your data jammed? I’ll show you how you can turn your database into an information highway. Get your recipes for query performance improvements here.
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Stephen Taylor is the community manager for Ether Camp. They provide an analysis tool for the Ethereum blockchain, ‘Block Explorer’ and also an ‘Intergrated Development Environment’ (I.D.E) that empowers developers to build, test and deploy applications in a sandbox environment. This November they are launching their second annual hackathon, hack.ether.camp which is aiming to deliver a more sustained approach to the hackathon ideology, by utilising blockchain technology.
More Related Content
Similar to Sean Kandel - Data profiling: Assessing the overall content and quality of a data set
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
Practical deep learning for computer visionEran Shlomo
This is the presentation given in TLV DLD 2017. In this presentation we walk through the planning and implemintation of deeplearning solution for image recognition, with focus on the data.
It is based on the work we do at dataloop.ai and its customers.
With datasets becoming larger, new visualizations techniques are needed. One problem is that the number of datapoints is so large, that a scatter plot becomes cluttered. Another problem is that with over a billion objects, only a few cpu cycles are available per object if one wants to process them within a second, making traditional methods not viable.
I will show that it is possible to visualize a billion objects in about 1 second on a modern desktop computer using memory mapping of hdf5 files together with a simple binning algorithm in Python. Density fields in 1, 2 and 3d are computed and statistics of any properties of the objects, or any mathematical operation on them can be done on the fly. This enables efficient exploration or large datasets interactively, making science exploration feasible.
This idea is implemented in a Python library called vaex, which integrates well in the Jupyter/Numpy/Astropy stack. Build on top of this is the vaex application, which allows for interactive exploration and visualization.
The motivation for vaex is the upcoming astronomical catalogue of the Gaia satellite, which will contain properties of over a billion stars. However, vaex can also be used on N-body simulations, other catalogues or any tabular data.
https://www.astro.rug.nl/~breddels/vaex
https://github.com/maartenbreddels/vaex
From USENIX LISA 2010, San Jose.
Visualizations that include heat maps can be an effective way to present performance data: I/O latency, resource utilization, and more. Patterns can emerge that would be difficult to notice from columns of numbers or line graphs, which are revealing previously unknown behavior. These visualizations are used in a product as a replacement for traditional metrics such as %CPU and are allowing end users to identify more issues much more easily (and some issues are becoming nearly impossible to identify with tools such as vmstat(1)). This talk covers what has been learned, crazy heat map discoveries, and thoughts for future applications beyond performance analysis.
Make Life Suck Less (Building Scalable Systems)guest0f8e278
This presentation was given at LinkedIn. It is a collection of guidelines and wisdom for re-thinking how we do engineering for massively scalable systems. Useful for anyone who cares about Big Data, Distributed Computing, Hadoop, and more.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
This slide deck explains a bit how to deal best with state in scalable systems, i.e. pushing it to the system boundaries (client, data store) and trying to avoid state in-between.
Then it picks arbitrarily two scenarios - one in the frontend part and one in the backend part of a system and shows concrete techniques to deal with them.
In the frontend part is examined how to deal with session state of servlet containers in scalable scenarios and introduces the concept of a shared session cache layer. Also an example implementation using Redis is shown.
In the backend part it is examined how to deal with potential data inconsistencies that can occur if maximum availability of the data store is required and eventual consistency is used. The normal way is to resolve inconsistencies manually implementing business specific logic or - even worse - asking the user to resolve it. A pure technical solution called CRDTs (Conflict-free Replicated Data Types) is then shown. CRDTs, based on sound mathematical concepts, are self-stabilizing data structures that offer a generic way to resolve inconsistencies in an eventual consistent data store. Besides some theory also some examples are shown to provide a feeling how CRDTs feel in practice.
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
Speaker: Oscar Westra van Holthe - Kind
Genre & level: Backend, Medior
Data congestion slows down even the fastest database. Want to avoid getting your data jammed? I’ll show you how you can turn your database into an information highway. Get your recipes for query performance improvements here.
Similar to Sean Kandel - Data profiling: Assessing the overall content and quality of a data set (20)
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Stephen Taylor is the community manager for Ether Camp. They provide an analysis tool for the Ethereum blockchain, ‘Block Explorer’ and also an ‘Intergrated Development Environment’ (I.D.E) that empowers developers to build, test and deploy applications in a sandbox environment. This November they are launching their second annual hackathon, hack.ether.camp which is aiming to deliver a more sustained approach to the hackathon ideology, by utilising blockchain technology.
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.
Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
Big organisations have a wealth of rich customer data which opens up huge new opportunities. However, they have the challenge of how to extract value from this data while protecting the privacy of their individual customers. He will talk about the risks organisations face, and what they should do about it. He will survey the techniques which can be used to make data safe for analysis, and talk briefly about how they are solving this problem at Privitar.
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
IBM is developing the Watson Ecosystem to leverage its Developer Cloud, APIs, Content Store and Talent Hub. This is part of IBM's recent announcement of the $1B investment in Watson as a new business unit including Silicon Alley NYC headquarters. For the first time, IBM will open up Watson as a development platform in the Cloud to spur innovation and fuel a new ecosystem of entrepreneurial software app providers who will bring forward a new generation of applications infused with Watson's cognitive computing intelligence.
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
Sean Owen – Director of Data Science @Cloudera
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
Martin Oberhuber and Eliano Marques, Senior Data Scientists @Think Big International
In this talk Think Big International Lead Data Scientists will discuss the options that exist today for engineering and data science teams aiming to use big data patterns to solve new business problems. With the enterprise adoption of the Hadoop ecosystem and the emerging momentum of open source projects like Spark it is becoming mandatory to have an approach that solves for business results but remains flexible to adapt and change with the open source market.
Signal Media: Real-Time Media & News Monitoringhuguk
Startup pitch presented by CTO Wesley Hall. Signal Media is a real-time media and news monitoring platform that tracks media outlets. News items are analysed for brand & media monitoring as well as market intelligence.
Startup pitch presented by Aeneas Wiener. Cytora is a real-time geopolitical risk analysis platform that extracts events from open-source intelligence and evaluates these events on their geopolitical impact.
Startup pitch presented by co-founder and CEO Jaco Els. Cubitic offers a predictive analytics platform that allows developers to build custom solutions for analytics and visualisation on top of a machine learning engine.
Startup pitch presented by co-founder and CEO Corentin Guillo. Bird.i is building a platform for up-to-date earth observation data that will bring satellite imagery to the mass market. Providing fresh imagery together with analytics around the forecast of localised demand opens up innovative opportunities in sectors like construction, tourism, real-estate and remote facility monitoring.
Startup pitch presented by co-founders Laure Andrieux and Nic Greenway. Aiseedo applies real-time machine learning, where the model of the world is constantly updated, to build adaptive systems which can be applied to robotics, the Internet of Things and healthcare.
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
This talk will cover the design and implementation decisions that have been key to the success of Apache Spark over other competing cluster computing frameworks. It will be delving into the whitepaper behind Spark and cover the design of Spark RDDs, the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics.
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
Technical developments in the area of data warehousing have allowed companies to push their analysis a step further and, therefore, allowed data scientists to deliver more value to business areas. In that session, we will focus on the case of performance marketing at King and demonstrate how we use Hadoop capabilities to exploit user-level data efficiently. That approach results in obtaining a more holistic view in a return-on-investment analysis of TV advertisement.
Hadoop - Looking to the Future By Arun Murthyhuguk
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
4. … Become Persistent Questions
in the Data Lifecycle
What’s in this data?
Can I make use of it?
Unboxing Transformation Analysis Visualization Productization
7. “Its easy to just think you know what you
are doing and not look at data at every
intermediary step.
An analysis has 30 different steps. Its
tempting to just do this then that and then
this. You have no idea in which ways you
are wrong and what data is wrong.”
8. What’s in the data?
• The Expected: Models, Densities, Constraints
• The Unexpected: Residuals, Outlier, Anomalies
21. Mapping out the Design Space
How much data to examine?
How accurate are the results?
How fast can you get them?
22. Mapping out the Design Space
Decide how your requirements fall on these axes
Find a strategy (if one exists) that fits the requirements
Accuracy
Urgency
Data Volume
24. Strategy vs Cost
Random Sample
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
25. Strategy vs Cost
Scan, summarize, collect samples
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
26. Far better an approximate answer
to the right question, which is often
vague, than the exact answer to
the wrong question, which can
always be made precise.
Data Analysis & Statistics, Tukey & Wilk 1966
28. Sanity Check: Is this really expensive?
• Computers are fast
• In-memory, column stores, OLAP, …
• Still, “Big Data” can be hard
• Big is sometimes really big
• Big data can be raw: no indexes or precomputed summaries
• Agility remains critical to harness the “informed human mind”
29. Two Useful Techniques
Sampling
• A variety of techniques available
Sketches
• One-pass memory-efficient structures for capturing distributions
Accuracy
Urgency
Data Volume
31. Approaches to Sampling
• Scan-based access
• Head-of-file
• Bernoulli
• Reservoir
• Random I/O Sampling
• Block-level sampling
32. Head-of-File
• Pros:
• Very fast: small data, no disk seeks
• Absolutely required when unboxing raw data
• Nested data (JSON/XML), Text (logs, database dumps, etc.)
• Cons:
• Correlation of position and value
33. Bernoulli
• Take a full pass, flip a (weighted) coin for each record
• Pros:
• trivial to implement
• trivial to parallelize
• almost no memory required
• Cons:
• requires a full scan of the data
• output size proportional to input size, and random
filter(lambda x : random() < 0.01, data)
34. Reservoir
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
… 61141217 139
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
35. 1141217
Reservoir … 6 133
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
36. 41217
Reservoir … 6 137 3
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
37. Meta-Strategy: Stratified Sampling
• Sometimes you need representative samples from each “group”
• Coverage: e.g., displaying examples for every state in a map
• Robustness: e.g., consider average income
• if you miss the rare top tax bracket, estimate is way off
38. Stratification: the GroupBy / Agg pattern
• Given:
• A group-partitioning key for stratification
• Sizes for each stratum
• Easy to implement: partition, and construct sample per partition
• your favorite sampling technique applies
SELECT D.group_key, reservoir(D.value)
FROM data D
GROUP BY D.group_key;
39. Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
40. Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
• Pretty inefficient: touches 1-(1-r)p blocks
42. Block Sampling
• Randomly sample blocks of records from disk
• Concern: clustering bias.
• Techniques from database literature: assess bias and correct
• Beware: even block sampling needs to be well below 5%.
43. Sampling in Hadoop
• Larger unit of access: HDFS blocks (128MB vs. 64KB)
• HDFS buffering makes forward seeking within block cheaper
• But CPU costs may encourage sampling within the block.
• …and Hadoop makes it easy to sample across nodes
• Each worker only processes one block
• Must find record boundaries
• Tougher when dealing with quote escaping
45. Sketching
• Family of algorithms for estimating contents of a data stream
• Constant-sized memory footprint
• Computed in 1 pass over the data
• Classic Examples
• Bloom filter: existence testing
• HyperLogLog Sketches (FM): distinct values
• CountMin (CM): a surprisingly versatile sketch for frequencies
62. CountMin (and CountMeanMin) answer “point frequency queries”.
Surprisingly, we can use them to answer many more questions
• densities
• even order statistics (median, quantiles, etc.)
The Versatile CountMin Sketch
75. More Statistics
• Count-Range Queries
• Median
• Quantiles: generalization of Median
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
76. More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms:
• fixed-width bins: range queries
• fixed-height bins: quantiles
1-10 11-20 21-30 31-40