Invited Talk at Modern Data Management Systems Summit on August 29-30, 2014 at Tsinghua University in Beijing, China.
http://ise.thss.tsinghua.edu.cn/MDMS/English/program.jsp
Abstract:
Modern enterprises are increasingly relying on complex analyses on large data sets to drive business decisions. Tasks such as root cause analysis from system logs and lead generation based on social media, customer retention and digital marketing are rapidly gaining importance. These applications generally consist of three major analytic phases: text analytics, semi-structured data processing (joins, group-by, aggregation), and statistical/predictive modeling. The size of the datasets in conjunction with the complexity of the analysis necessitates large-scale distributed processing of the analytical algorithms. At IBM we are building tools and technologies based on declarative languages to support each of these analytic phases. The declarative nature of the language abstracts away the need for programmer-optimization. Furthermore, the syntax of these languages is designed to appeal to the corresponding communities. As an example for statistical modeling, we expose a high-level language with syntax similar to R -- a very popular statistical processing language.
In this talk I will give an overview of some real-world big data applications we are currently working on and use that to motivate the need for declarative analytics consisting of the three major phases discussed above. I will then describe, in some detail, declarative systems for text analytics along with a discussion on speeds, feeds and comparisons.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
This is the slides used in our 3-hour tutorial at VLDB'2014.
Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014)
Abstract:
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise
search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
Deep learning is the prime avenue for Artificial Intelligence, with spectacular accomplishments in diverse fields such as computer vision, natural language processing, and board games such as Go. Its impact on e-commerce is already significant and will continue to grow in future years. In this talk, we will review some of the successful deep learning algorithms in light of their current and expected impact on e-commerce.
The success of data-driven solutions to dicult problems,
along with the dropping costs of storing and processing mas-
sive amounts of data, has led to growing interest in large-
scale machine learning. This paper presents a case study
of Twitter's integration of machine learning tools into its
existing Hadoop-based, Pig-centric analytics platform. We
begin with an overview of this platform, which handles \tra-
ditional" data warehousing and business intelligence tasks
for the organization. The core of this work lies in recent Pig
extensions to provide predictive analytics capabilities that
incorporate machine learning, focused specically on super-
vised classication. In particular, we have identied stochas-
tic gradient descent techniques for online learning and en-
semble methods as being highly amenable to scaling out to
large amounts of data. In our deployed solution, common
machine learning tasks such as data sampling, feature gen-
eration, training, and testing can be accomplished directly
in Pig, via carefully crafted loaders, storage functions, and
user-dened functions. This means that machine learning
is just another Pig script, which allows seamless integration
with existing infrastructure for data management, schedul-
ing, and monitoring in a production environment, as well
as access to rich libraries of user-dened functions and the
materialized output of other scripts.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
This is the slides used in our 3-hour tutorial at VLDB'2014.
Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014)
Abstract:
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise
search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
Deep learning is the prime avenue for Artificial Intelligence, with spectacular accomplishments in diverse fields such as computer vision, natural language processing, and board games such as Go. Its impact on e-commerce is already significant and will continue to grow in future years. In this talk, we will review some of the successful deep learning algorithms in light of their current and expected impact on e-commerce.
The success of data-driven solutions to dicult problems,
along with the dropping costs of storing and processing mas-
sive amounts of data, has led to growing interest in large-
scale machine learning. This paper presents a case study
of Twitter's integration of machine learning tools into its
existing Hadoop-based, Pig-centric analytics platform. We
begin with an overview of this platform, which handles \tra-
ditional" data warehousing and business intelligence tasks
for the organization. The core of this work lies in recent Pig
extensions to provide predictive analytics capabilities that
incorporate machine learning, focused specically on super-
vised classication. In particular, we have identied stochas-
tic gradient descent techniques for online learning and en-
semble methods as being highly amenable to scaling out to
large amounts of data. In our deployed solution, common
machine learning tasks such as data sampling, feature gen-
eration, training, and testing can be accomplished directly
in Pig, via carefully crafted loaders, storage functions, and
user-dened functions. This means that machine learning
is just another Pig script, which allows seamless integration
with existing infrastructure for data management, schedul-
ing, and monitoring in a production environment, as well
as access to rich libraries of user-dened functions and the
materialized output of other scripts.
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
BigInsights and Text Analytics.
As enterprises seek to gain operational efficiencies and competitive advantage through greater use of analytics, much of the new information they need to analyze is found in text documents and, increasingly, in a wide variety of social media sites and portals. A critical step in gaining insights from this information is extracting core data from huge volumes of text. That data is then available for downstream analytic, mining and machine learning tools. AQL (Annotator Query Language) is a powerful declarative, rule-based language for the extraction of information from text documents.
BigMLSchool: ML Platforms and AutoML in the EnterpriseBigML, Inc
An introductory session on the current situation of Machine Learning, the importance of ML platforms and AutoML, and why ML needs to be properly taught at schools and universities.
The lecturer is Ed Fernández, Board Director at BigML and Arowana International, a Private Equity firm, Faculty at Northeastern University (the Silicon Valley campus), lecturer at Headspring Corporate Learning (the Joint Venture of Financial Times and IE Business School), and mentor at Berkeley Sutardja Center for Entrepreneurship and Technology.
*Machine Learning School for Business Schools 2021: Virtual Conference.
Guiding through a typical Machine Learning PipelineMichael Gerke
Many People are talking about AI and Machine Learning. Here's a quick guideline how to manage ML Projects and what to consider in order to implement machine learning use cases.
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
How to use open source technologies to build search engines for billions of users, billions of revenue, billions of documents
Keynote talk at The 16th International Conference on Open Source Systems.
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
Users are constantly searching for new content and to stay competitive organizations must act immediately based on up-to-date data. Outdated recommendations decrease the likelihood of presenting the right offer and make it harder to maintain customer loyalty. In order to provide the most relevant recommendations and increase engagement, organizations must track customer interactions and re-score recommendations on the fly.
Data sources have expanded dramatically to include a wealth of historical data and a constant influx of behavior data. The key to moving from predictive models, applied in batch, to models that provide responses in real time, is to focus on the efficiency of model application. The speed that recommendations can be served is influenced by:
Architecture of the recommendation serving platform
Choice of recommendation algorithm
Datastore access patterns
In this presentation, we’ll discuss how developers can use open source components like HBase and Kiji to develop low-latency recommendation models that can be easily deployed by e-commerce companies. We will give practical advice on how to choose models and design data stores that make use of the architecture and quickly serve new recommendations.
Scaling up business value with real-time operational graph analyticsConnected Data World
Graph-based solutions have been in the market for over a decade with deployments in financial services, healthcare, retail, and manufacturing. The graph technology of the past limited them to simple queries (1 or 2 hops), modest data sizes, or slow response times, which limited their value.
A new generation of fast, scalable graph databases, led by TigerGraph, is opening up a new world of business insight and performance. Join us, as we explore some new exciting use cases powered by native parallel graph database with storage and computation capability for each node:
A large financial services payment provider is using graph-based pattern detection (7 to 11 hop queries) to detect more fraud and money laundering in real time, handling peak volume of 256,000 transactions per second.
IceKredit, an innovative FinTech is transforming the near-prime and sub-prime credit market in United States, China and South Asian countries with customer 360 analytics for credit approval and ongoing monitoring.
A biotech and pharmaceutical giant is building a prescriber and patient 360 graph and using multi-hop exploratory and analytic queries to understand the most efficient ways of launching a new drug for maximum return.
Wish.com is delivering real-time personalized recommendations to increase eCommerce revenue.
Machine Learning with Big Data using Apache SparkInSemble
"Machine Learning with Big Data
using Apache Spark" was presented to Lansing Big Data and Hadoop User Group by Muk Agaram and Amit Singh on 3/31/2015. It goes over the basics of machine learning and demos a use case of predicting recession using Apache Spark through Logistic Regression, SVM and Random Forest Algorithm
Recent Gartner and Capgemini studies predict only around 25% of data science projects are successful and only around 15% make it to full-scale production. Of these, many degrade in performance and produce disappointing results within months of implementation. How can focusing on the desired business outcomes and business use cases throughout a data science project help overcome the odds?
Introduction to Machine Learning and Data Science using the Autonomous databa...Sandesh Rao
This session will focus on basics of what Machine Learning is, different types of Machine Learning and Neural Networks, supervised and unsupervised machine learning, AutoML for training models and this ends with an example of how to predict workloads using Average Active sessions and different algorithms as an example and also how to predict maintenance windows for your databases. We will also use many examples from the ADW Oracle Autonomous Database offering, Oracle Machine Learning library to make this a session with lots of code examples in addition to the theory of Machine Learning and you will walk out having a definitive path to being a data scientist
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Simplilearn
This presentation about "Data Science Engineer Career, Salary, and Resume" will help you understand who is a Data Science Engineer, the salary of a Data Science Engineer, Data Science Engineer Skillset and Data Science Engineer Resume. Data science is a systematic way to analyze a massive amount of data and extract information from them. Data Science can answer a lot of questions, as well. Data Science is mainly required for
better decision making, predictive analysis, and pattern recognition.
Below are topics that we will be discussing in this presentation:
1. Introduction to Data Science
2. Who is a Data Science Engineer
3. Data Science Engineer Skillset
4. Data Science Engineer job roles
5. Data Science Engineer salary trends
6. Data Science Engineer Resume
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. The data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data, you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its large library of mathematical functions
Perform scientific and technical computing using the SciPy package and its sub-packages such as Integrate, Optimize, Statistics, IO, and Weave
4. Perform data analysis and manipulation using data structures and tools provided in the Pandas package
5. Gain expertise in machine learning using the Scikit-Learn package
Data Science with python is recommended for:
1. Analytics professionals who want to work with Python
2. Software professionals looking to get into the field of analytics
3. IT professionals interested in pursuing a career in analytics
4. Graduates looking to build a career in analytics and data science
Learn more at https://www.simplilearn.com/big-data-and-analytics/python-for-data-science-training
Five Critical Success Factors for Big Data and Traditional BIInside Analysis
The Briefing Room with Dr. Robin Bloor and VelociData
Live Webcast Dec. 10, 2013
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7909837&rKey=b0bac7d09bf1a638
Most Big Data discussions focus on analytics, but business users need more than that. They need speed, because most opportunities these days are transient and must be acted on quickly. Bottlenecks in the delivery of analytic results often occur on the gathering and transformation side, where massive volumes of data must be validated, converted, masked or otherwise transformed before hitting the analytics engine. Big Data is rapidly overrunning conventional approaches, creating requirements for accelerated, hybrid systems.
Register for this episode of the Briefing Room to hear veteran IT Analyst Dr. Robin Bloor, as he explains how a combination of innovations is dramatically changing how companies can solve serious data transformation challenges. Robin will be briefed by Ron Indeck of VelociData, who will tout their record-breaking data operations appliance. He'll also discuss five critical success factors for achieving optimal performance, including the necessary infrastructure for executing data transformations at wire speed.
Visit InsideAnalysis.com for more information
The Data Lake: Empowering Your Data Science TeamSenturus
Data science overview: defined, purpose, relation to BI, differences from BI and benefits from using both data science and BI. View the webinar video recording and download this deck: http://www.senturus.com/resources/data-lake-empowering-data-science-team/.
Learn how the data lake can empower data science teams and free up valuable data warehouse resources.
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
BigInsights and Text Analytics.
As enterprises seek to gain operational efficiencies and competitive advantage through greater use of analytics, much of the new information they need to analyze is found in text documents and, increasingly, in a wide variety of social media sites and portals. A critical step in gaining insights from this information is extracting core data from huge volumes of text. That data is then available for downstream analytic, mining and machine learning tools. AQL (Annotator Query Language) is a powerful declarative, rule-based language for the extraction of information from text documents.
BigMLSchool: ML Platforms and AutoML in the EnterpriseBigML, Inc
An introductory session on the current situation of Machine Learning, the importance of ML platforms and AutoML, and why ML needs to be properly taught at schools and universities.
The lecturer is Ed Fernández, Board Director at BigML and Arowana International, a Private Equity firm, Faculty at Northeastern University (the Silicon Valley campus), lecturer at Headspring Corporate Learning (the Joint Venture of Financial Times and IE Business School), and mentor at Berkeley Sutardja Center for Entrepreneurship and Technology.
*Machine Learning School for Business Schools 2021: Virtual Conference.
Guiding through a typical Machine Learning PipelineMichael Gerke
Many People are talking about AI and Machine Learning. Here's a quick guideline how to manage ML Projects and what to consider in order to implement machine learning use cases.
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
How to use open source technologies to build search engines for billions of users, billions of revenue, billions of documents
Keynote talk at The 16th International Conference on Open Source Systems.
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
Users are constantly searching for new content and to stay competitive organizations must act immediately based on up-to-date data. Outdated recommendations decrease the likelihood of presenting the right offer and make it harder to maintain customer loyalty. In order to provide the most relevant recommendations and increase engagement, organizations must track customer interactions and re-score recommendations on the fly.
Data sources have expanded dramatically to include a wealth of historical data and a constant influx of behavior data. The key to moving from predictive models, applied in batch, to models that provide responses in real time, is to focus on the efficiency of model application. The speed that recommendations can be served is influenced by:
Architecture of the recommendation serving platform
Choice of recommendation algorithm
Datastore access patterns
In this presentation, we’ll discuss how developers can use open source components like HBase and Kiji to develop low-latency recommendation models that can be easily deployed by e-commerce companies. We will give practical advice on how to choose models and design data stores that make use of the architecture and quickly serve new recommendations.
Scaling up business value with real-time operational graph analyticsConnected Data World
Graph-based solutions have been in the market for over a decade with deployments in financial services, healthcare, retail, and manufacturing. The graph technology of the past limited them to simple queries (1 or 2 hops), modest data sizes, or slow response times, which limited their value.
A new generation of fast, scalable graph databases, led by TigerGraph, is opening up a new world of business insight and performance. Join us, as we explore some new exciting use cases powered by native parallel graph database with storage and computation capability for each node:
A large financial services payment provider is using graph-based pattern detection (7 to 11 hop queries) to detect more fraud and money laundering in real time, handling peak volume of 256,000 transactions per second.
IceKredit, an innovative FinTech is transforming the near-prime and sub-prime credit market in United States, China and South Asian countries with customer 360 analytics for credit approval and ongoing monitoring.
A biotech and pharmaceutical giant is building a prescriber and patient 360 graph and using multi-hop exploratory and analytic queries to understand the most efficient ways of launching a new drug for maximum return.
Wish.com is delivering real-time personalized recommendations to increase eCommerce revenue.
Machine Learning with Big Data using Apache SparkInSemble
"Machine Learning with Big Data
using Apache Spark" was presented to Lansing Big Data and Hadoop User Group by Muk Agaram and Amit Singh on 3/31/2015. It goes over the basics of machine learning and demos a use case of predicting recession using Apache Spark through Logistic Regression, SVM and Random Forest Algorithm
Recent Gartner and Capgemini studies predict only around 25% of data science projects are successful and only around 15% make it to full-scale production. Of these, many degrade in performance and produce disappointing results within months of implementation. How can focusing on the desired business outcomes and business use cases throughout a data science project help overcome the odds?
Introduction to Machine Learning and Data Science using the Autonomous databa...Sandesh Rao
This session will focus on basics of what Machine Learning is, different types of Machine Learning and Neural Networks, supervised and unsupervised machine learning, AutoML for training models and this ends with an example of how to predict workloads using Average Active sessions and different algorithms as an example and also how to predict maintenance windows for your databases. We will also use many examples from the ADW Oracle Autonomous Database offering, Oracle Machine Learning library to make this a session with lots of code examples in addition to the theory of Machine Learning and you will walk out having a definitive path to being a data scientist
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Simplilearn
This presentation about "Data Science Engineer Career, Salary, and Resume" will help you understand who is a Data Science Engineer, the salary of a Data Science Engineer, Data Science Engineer Skillset and Data Science Engineer Resume. Data science is a systematic way to analyze a massive amount of data and extract information from them. Data Science can answer a lot of questions, as well. Data Science is mainly required for
better decision making, predictive analysis, and pattern recognition.
Below are topics that we will be discussing in this presentation:
1. Introduction to Data Science
2. Who is a Data Science Engineer
3. Data Science Engineer Skillset
4. Data Science Engineer job roles
5. Data Science Engineer salary trends
6. Data Science Engineer Resume
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. The data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data, you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its large library of mathematical functions
Perform scientific and technical computing using the SciPy package and its sub-packages such as Integrate, Optimize, Statistics, IO, and Weave
4. Perform data analysis and manipulation using data structures and tools provided in the Pandas package
5. Gain expertise in machine learning using the Scikit-Learn package
Data Science with python is recommended for:
1. Analytics professionals who want to work with Python
2. Software professionals looking to get into the field of analytics
3. IT professionals interested in pursuing a career in analytics
4. Graduates looking to build a career in analytics and data science
Learn more at https://www.simplilearn.com/big-data-and-analytics/python-for-data-science-training
Five Critical Success Factors for Big Data and Traditional BIInside Analysis
The Briefing Room with Dr. Robin Bloor and VelociData
Live Webcast Dec. 10, 2013
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7909837&rKey=b0bac7d09bf1a638
Most Big Data discussions focus on analytics, but business users need more than that. They need speed, because most opportunities these days are transient and must be acted on quickly. Bottlenecks in the delivery of analytic results often occur on the gathering and transformation side, where massive volumes of data must be validated, converted, masked or otherwise transformed before hitting the analytics engine. Big Data is rapidly overrunning conventional approaches, creating requirements for accelerated, hybrid systems.
Register for this episode of the Briefing Room to hear veteran IT Analyst Dr. Robin Bloor, as he explains how a combination of innovations is dramatically changing how companies can solve serious data transformation challenges. Robin will be briefed by Ron Indeck of VelociData, who will tout their record-breaking data operations appliance. He'll also discuss five critical success factors for achieving optimal performance, including the necessary infrastructure for executing data transformations at wire speed.
Visit InsideAnalysis.com for more information
The Data Lake: Empowering Your Data Science TeamSenturus
Data science overview: defined, purpose, relation to BI, differences from BI and benefits from using both data science and BI. View the webinar video recording and download this deck: http://www.senturus.com/resources/data-lake-empowering-data-science-team/.
Learn how the data lake can empower data science teams and free up valuable data warehouse resources.
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
Take Action: The New Reality of Data-Driven BusinessInside Analysis
The Briefing Room with Dr. Robin Bloor and WebAction
Live Webcast on July 23, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=360d371d3a49ad256942f55350aa0a8b
The waiting used to be the hardest part, but not anymore. Today’s cutting-edge enterprises can seize opportunities faster than ever, thanks to an array of technologies that enable real-time responsiveness across the spectrum of business processes. Early adopters are solving critical business challenges by enabling the rapid-fire design, development and production of very specific applications. Functionality can range from improved customer engagement to dynamic machine-to-machine interactions.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor, who will tout a new era in data-driven organizations, and why a data flow architecture will soon be critical for industry leaders. He’ll be briefed by Sami Akbay of WebAction, who will showcase his company’s real-time data management platform, which combines all the component parts needed to access, process and leverage data big and small. He’ll explain how this new approach can provide game-changing power to organizations of all types and sizes.
Visit InsideAnlaysis.com for more information.
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...Amazon Web Services
Streaming analytics is about knowing and acting on what's happening in your business and with your customers right this second. Forrester calls these perishable insights because they occur at a moment's notice and you must act on them fast. The high velocity, whitewater flow of data from innumerable real-time data sources such as market data, internet of things, mobile, sensors, clickstream, and even transactions remain largely un-navigated by most firms. The opportunity to leverage streaming analytics has never been greater. In this session, Forrester analyst Mike Gualtieri explains the opportunity, use cases, and how to use cloud-based streaming solutions in your application architecture.
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...CIO Edge
Taken from our BCN Digital Festival last week, for info on attending, speaking or sponsoring our next event on the 29/30th April 2020 email enquiry@digitalenterprisefest.com
Data Analytics in eSports. UbeatCase Study
Building AI & Automate services need a solid base on Data Management, but the current environment is volatile, uncertain, complex and ambiguous so you never know what data will be important in the following months.
The Data Management Platform in an extremely dynamic market like eSports where everything is currently being created, in reinvention and is to be validated is even more challenging.
UBEAT is the leading streaming platform of eSports related content. Created in November 2018, it still hasn’t 12 months of existence but a lot of learnings in its rear mirror and a lot of future to come in its high beam. Especially regarding Data Management.
To apply to speak or sponsor our 2020 events goto www.digitalenterprisefest.com
This webinar featuring Claudia Imhoff, President of Intelligent Solutions & Founder of the Boulder BI Brain Trust (BBBT), Matt Schumpert, Director of Product Management and Azita Martin, CMO at Datameer, will highlight the latest technology trends in extending BI with big data analytics and the top high impact use cases.
Attendees will hear about:
-- The extended architecture for today's modern analytics environment
-- The Internet of Things (IoT) and big data
-- The evolution of analytics – from descriptive to prescriptive
-- High impact use cases as a result of the changing analytics world
Presumption of Abundance: Architecting the Future of SuccessInside Analysis
Hot Technologies with Dr. Claudia Imhoff, Dr. Robin Bloor and SAS
Live Webcast on Jan. 14, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=9431631f43a8c7561f2ba996750a4612
When resources are scarce, organizations focus heavily on keeping processes intact and costs down. The result is often a cycle of decisions that hinders development and ultimately leads to zero innovation. But these days, the market is teeming with game-changing solutions with more attractive price points, paving the way toward a new mindset and an era of abundance.
Register for this episode of Hot Technologies to learn from veteran Analysts Claudia Imhoff and Robin Bloor as they discuss how the proliferation of data and analytics is forcing the enterprise to rethink and redesign its architecture. They’ll be briefed by Gary Spakes of SAS, who will explain his company’s approach to Big Data analytics. He will show how disruptive technologies like Hadoop can give organizations the scalability and reliability they need, and at the same time boost data discovery, analytic innovation and time-to-value.
Visit InsideAnalysis.com for more information.
Lecture to the London S2DS students.
Some fun in highlighting that I'm their polar opposite (no schooling since 17, and focused on operations not science).
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
The Briefing Room with Mark Madsen and WebAction
Live Webcast Feb. 10, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=fa83c6283de99dfb6f38b9d7199cb452
In our increasingly interconnected world, the windows of opportunity for meaningful action are shrinking. Where hours once sufficed, minutes are now the norm. For some transactions, seconds make all the difference, even sub-seconds. Meeting these demands requires a new approach to information architecture, one that embraces the many innovations that are fundamentally changing the data-driven economy.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature as he explains how a confluence of advances are changing the nature of data management. He'll be briefed by Sami Akbay of WebAction, who will showcase his company's real-time data platform, designed from the ground up to meet the challenges of leveraging Big Data in concert with all manner of operational enterprise systems.
Visit InsideAnalysis.com for more information.
IW14 Session: Mike Gualtieri, Forrester ResearchSoftware AG
Session: Apama & Terracotta World; Big Data Streaming Analytics - Right Here, Right Now
Presentation Title: Streaming Analytics Is Icing On The Big Data Cake
Presentation given by Mike Gualtieri, Principal Analyst at Forrester Research, during the Apama & Terracotta World Session at Innovation World 2014 conference, Oct 13-15, 2014, at the Hyatt Regency New Orleans, produced by Software AG. Three days of vision, inspiration and insight. Innovation World is THE global event for digital leaders who are driven to leverage the Software AG Suite: Alfabet, Apama, ARIS, webMethods, Software AG Live, Terracotta and Adabas-Natural.
Big Data Tools PowerPoint Presentation SlidesSlideTeam
Enhance your audiences knowledge with this well researched complete deck. Showcase all the important features of the deck with perfect visuals. This deck comprises of total of twenty slides with each slide explained in detail. Each template comprises of professional diagrams and layouts. Our professional PowerPoint experts have also included icons, graphs and charts for your convenience. All you have to do is DOWNLOAD the deck. Make changes as per the requirement. Yes, these PPT slides are completely customizable. Edit the colour, text and font size. Add or delete the content from the slide. And leave your audience awestruck with the professionally designed Big Data Tools PowerPoint Presentation Slides complete deck. http://bit.ly/39AwSro
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
"Big Data" is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.
"Fast Data" via stream processing is the solution to embed patterns - which were obtained from analyzing historical data - into future transactions in real-time. This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. The session discusses how patterns and statistical models of R, Spark MLlib and other technologies can be integrated into real-time processing using open source frameworks (such as Apache Storm, Spark or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo shows the complete development lifecycle combining analytics, machine learning and stream processing.
Streaming analytics webinar | 9.13.16 | Guest: Mike Gualtieri from ForresterCubic Corporation
Business success relies heavily on taking the right action, at the right time, all the time. And actions are dictated by data. But the batch-oriented, collect-store-contemplate model employed by Big Data Analytics technologies is incomplete because it does not make use of live data in real time. Without live, real-time data insights gathered are not up-to-date, and cannot accurately inform applications and services that would benefit from continuous, real-time context for time-sensitive decisions.
To thrive, businesses need to be able to use both live and historical data in their applications and services, continuously, concurrently, and correctly and the only technology currently capable of handling it is streaming analytics. Streaming analytics computes data right now, when it can be analyzed and put to good use to make applications of all kinds contextual and smarter.
This webinar held in collaboration with Forrester, Inc., showcased how streaming analytics applications can be built in minutes, to:
- Aggregate, enrich, and analyze a high throughput of data from multiple, disparate live data sources and in any format to identify patterns, detect opportunities, automate actions, and dynamically adapt
- Easily ingest streaming data from multiple disparate sources to multiple sources, within and between cloud and on-premises environments
- Analyze and act on data as it arrives, without needing to store, eliminating unnecessary security risks and storage costs
- Enable real-time analytics with existing business intelligence and data assets.
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on April 1, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03
In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands.
Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions.
Visit InsideAnlaysis.com for more information.
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
FRN combines the high quality, authoritative anti-fraud and audit content from the leading providers, AuditNet ® LLC and White-Collar Crime 101 LLC/FraudAware.
The two entities designed FRN as the “go-to”, easy-to-use source of “how-to” fraud prevention, detection, audit and investigation templates, guidelines, policies, training programs (recorded no CPE and live with CPE) and articles from leading subject matter experts.
FRN is a continuously expanding and improving resource, offering auditors, fraud examiners, controllers, investigators and accountants a content-rich source of cutting-edge anti-fraud tools and techniques they will want to refer to again and again.
White-Collar Crime Fighter Newsletter Subscribe Now at No Cost!
FraudResourceNet has made the premier Anti-Fraud newsletter, White-Collar Crime Fighter freely available to all. All this is required is to complete the registration form with your work email address!
The widely read newsletter, White-Collar Crime Fighter brings you expert strategies and actionable advice from the most prominent experts in the fraud-fighting business. Every two months you'll learn about the latest frauds, scams and schemes... and the newest and most effective fraud-fighting tools, techniques and technologies to put to work immediately to protect your organization.
When it comes to fraud, knowledge of the countless schemes, how they work and red flags to look for will help keep you, your organization and your clients safe.
At FraudResourceNet we understand this and take great pride in providing our FREE White Collar Crime Fighter newsletter -- filled with exclusive articles and tips to provide the knowledge you need.
Make sure you stay informed. Sign up for White Collar Crime Fighter newsletter and we’ll keep you up-to-date on special promos, training opportunities, and other news and offers from FraudResourceNet!
Signing up is easy and FREE. If you have not already subscribed to our newsletter, please sign up to get started!
Sign up for the White Collar Crime Fighter Newsletter (a $99 value ... now completely FREE)
Predicting Medical Test Results using Driverless AISri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/n9g9GxIJoT4
Description:
The goal of the research was to develop an approach to predict individual medical test results based on longitudinal medical and pharma claims data without direct lab measures using data-driven techniques. Such discoveries may result in improved treatment strategies. In the presentation we demonstrate how Driverless AI was used both for estimating highly accurate model and results explanations.
Speaker's Bio:
Alexander is the Data Science leader at poder.IO. He is responsible for data flow architecture and insight mining, all powered by machine learning. Before joining poder.IO, Alexander made an academic career at Belarusian State University and Minsk Innovation University where he was Head of the Informatics and Mathematics Department.
Similar to The Power of Declarative Analytics (20)
The Role of Patterns in the Era of Large Language ModelsYunyao Li
Slides for my keynote at PAN-DL Workshop (Pattern-based Approaches to NLP in the Age of Deep Learning) at EMNLP'2023 (December. 6, 2023).
In this talk, I share our initial learnings from constructing, growing and serving large knowledge graphs
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
Keynote talk at HILDA'2023 at SIGMOD on June 18, 2023.
Abstract: The ability to build large-scale knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the building, growing and serving of such knowledge bases. This approach relies on several well-known building blocks: document conversion, natural language processing, entity resolution, data transformation and fusion. In this talk, I will discuss wide range of real-world challenges related to the building of these blocks and present our work to address these challenges via better human-machine cooperation.
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
EMNLP'2022 Tutorial "Meaning Representations for Natural Languages: Design, Models and Applications"
Instructors: Jeffrey Flanigan, Ishan Jindal, Yunyao Li, Tim O’Gorman, Martha Palmer
Abstract:
We propose a cutting-edge tutorial that reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We will also present qualitative comparisons of common meaning representations and a quantitative study on how their differences impact model performance. Finally, we will share best practices in choosing the right meaning representation for downstream tasks.
Invited talk at Document Intelligence workshop at KDD'2021.
Harvesting information from complex documents such as in financial reports and scientific publications is critical to building AI applications for business and research. Such documents are often in PDF format with critical facts and data conveyed in table and graphs. Extracting such information is essential to extract insights from these documents. In IBM Research, we have a rich agenda in this area that we call Deep Document Understanding. In this talk, I will focus on our research on Deep Table Understanding — extracting and understanding tables from PDF documents. I will introduce key challenges in table extraction and understanding and how we address such challenges, from how to acquire data at scale to enable deep neural network models to how to build, customize and evaluate such models. I will also describe how our work enables real-world use cases in domains such as finance and life science. Finally, I will briefly present TableQA, an important downstream task enabled by Deep Table Understanding.
Explainability for Natural Language ProcessingYunyao Li
Final deck for our popular tutorial on "Explainability for Natural Language Processing" at KDD'2021. See links below for downloadable version (with higher resolution) and recording of the live tutorial.
Title: Explainability for Natural Language Processing
Presenter: Marina Danilevsky, Shipi Dhanorkar, Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Recording: https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s
Downloadable version with higher resolution: https://drive.google.com/file/d/1_gt_cS9nP9rcZOn4dcmxc2CErxrHW9CU/view?usp=sharing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Shipi Dhanorkar and Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
Explainability for Natural Language ProcessingYunyao Li
NOTE: Please check out the final version here with small but important updates and links to downloadable version and recording: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249992241
Updated version on our popular tutorial on "Explainability for Natural Language Processing" as a tutorial at KDD'2021.
Title: Explainability for Natural Language Processing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Dhanorkar, Shipi and Li, Yunyao and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Presenter: Marina Danilevsky, Dhanorkar, Shipi and Li, Yunyao and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
Slides for talk given at Women in Engineering on March 20, 2021.
Abstract:
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Explainability for Natural Language ProcessingYunyao Li
Tutorial at AACL'2020 (http://www.aacl2020.org/program/tutorials/#t4-explainability-for-natural-language-processing).
More recent version: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249912819
Title: Explainability for Natural Language Processing
@article{aacl2020xaitutorial,
title={Explainability for Natural Language Processing},
author= {Dhanorkar, Shipi and Li, Yunyao and Popa, Lucian and Qian, Kun and Wolf, Christine T and Xu, Anbang},
journal={AACL-IJCNLP 2020},
year={2020}
Presenter: Shipi Dhanorkar, Christine Wolf, Kun Qian, Anbang Xu, Lucian Popa and Yunyao Li
Video: https://www.youtube.com/watch?v=3tnrGe_JA0s&feature=youtu.be
Abstract:
We propose a cutting-edge tutorial that investigates the issues of transparency and interpretability as they relate to NLP. Both the research community and industry have been developing new techniques to render black-box NLP models more transparent and interpretable. Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP researchers, our tutorial has two components: an introduction to explainable AI (XAI) and a review of the state-of-the-art for explainability research in NLP; and findings from a qualitative interview study of individuals working on real-world NLP projects at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability in NLP. Then, we will discuss explainability for NLP tasks and report on a systematic literature review of the state-of-the-art literature in AI, NLP, and HCI conferences. The second component reports on our qualitative interview study which identifies practical challenges and concerns that arise in real-world development projects which include NLP.
Towards Universal Language Understanding (2020 version)Yunyao Li
Keynote talk given at Pacific Asia Conference on Language, Information and Computation (PACLIC 34) on Pacific Asia Conference on Language, Information and Computation (PACLIC 34) on October 24, 2020.
Title: Towards Universal Natural Language Understanding
Abstract:
Understanding the semantics of the natural language is a fundamental task in artificial intelligence. English semantic understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Compare and Comply. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in addressing these challenges in the past few years to provide the same unified semantic representation across languages. We will also showcase how such universal semantic understanding of natural languages can enable cross-lingual information extraction in concrete domains (e.g. insurance and compliance) and show promise towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
Keynote talk at TextXD 2019(https://www.textxd.org)
Abstract:
Understanding the semantics of the natural language is a fundamental task in artificial intelligence. English semantic understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Compare and Comply. However, scaling existing products/services to support additional languages remain an open challenge. In this demo, we will present Polyglot, a multilingual semantic parser capable of semantically parsing sentences in 9 different languages from 4 different language groups into the same unified semantic representation. We will also showcase how such universal semantic understanding of natural languages can enable cross-lingual information extraction in concrete domains (e.g. insurance and compliance) and show promise towards seamless scaling existing NLP capabilities across languages with minimal efforts.
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
Poster corresponding to our NAACL'2015 paper "An In-depth Analysis of the Effect of Text Normalization in Social Media"
Abstract: Recent years have seen increased interest in text normalization in social media, as the in-formal writing styles found in Twitter and other social media data often cause problems for NLP applications. Unfortunately, most current approaches narrowly regard the nor- malization task as a “one size fits all" task of replacing non-standard words with their standard counterparts. In this work we build a taxonomy of normalization edits and present a study of normalization to examine its effect on three different downstream applications (de- pendency parsing, named entity recognition, and text-to-speech synthesis). The results sug- gest that how the normalization task should be viewed is highly dependent on the targeted application. The results also show that normalization must be thought of as more than word replacement in order to produce results comparable to those seen on clean text.
Paper: https://www.aclweb.org/anthology/N15-1045
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
Slides for our COLING'18 paper: http://aclweb.org/anthology/C18-1058
Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
Slides for our COLING'16 paper http://aclweb.org/anthology/C/C16/C16-1058.pdf
Abstract:
Semantic role labeling (SRL) is the task of identifying and labeling predicate-argument structures in sentences with semantic frame and role labels. A known challenge in SRL is the large number of low-frequency exceptions in training data, which are highly context-specific and difficult to generalize. To overcome this challenge, we propose the use of instance-based learning that performs no explicit generalization, but rather extrapolates predictions from the most similar instances in the training data. We present a variant of k-nearest neighbors (kNN) classification with composite features to identify nearest neighbors for SRL. We show that high-quality predictions can be derived from a very small number of similar instances. In a comparative evaluation we experimentally demonstrate that our instance-based learning approach significantly outperforms current state-of-the-art systems on both in-domain and out-of-domain data, reaching F1-scores
of 89,28% and 79.91% respectively
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
Slides deck for SIGMOD 2017 Tutorial.
ABSTRACT:
The volume of natural language text data has been rapidly increasing over the past two decades, due to factors such as the growth of the Web, the low cost associated to publishing and the progress on the digitization of printed texts. This growth combined with the proliferation of natural language systems for search and retrieving information provides tremendous opportunities for studying some of the areas where database systems and natural language processing systems overlap. This tutorial explores two more relevant areas of overlap to the database community: (1) managing
natural language text data in a relational database, and (2) developing natural language interfaces to databases. The tutorial presents state-of-the-art methods, related systems, research opportunities and challenges covering both area.
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
Poster for our ACL paper "Polyglot: Multilingual Semantic Role Labeling with Unified Labels".
Abstract:
We present POLYGLOT, a semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. A core differentiator is that this system predicts English Proposition Bank labels for all supported languages. This means that
for instance a Japanese sentence will be tagged with the same labels as an English sentence with similar semantics would be. This is made possible by training the system with target language data that was automatically labeled with English PropBank labels using an annotation projection approach. We give an overview of our system, the automatically produced training data, and discuss possible applications
and limitations of this work. We present a demonstrator that accepts sentences in English, German, French, Spanish, Japanese, Chinese, Arabic, Russian and Hindi and
outputs a visualization of its shallow semantics.
Tyler Baldwin, Yunyao Li, Bogdan Alexe, Ioana Roxana Stanoi: Automatic Term Ambiguity Detection. ACL (2) 2013: 804-809
Abstract:
While the resolution of term ambiguity is important for information extraction (IE) systems, the cost of resolving each instance of an entity can be prohibitively expensive on large datasets. To combat this, this work looks at ambiguity detection at the term, rather than the instance,
level. By making a judgment about the general ambiguity of a term, a system is able to handle ambiguous and unambiguous cases differently, improving through-put and quality. To address the term ambiguity detection problem, we employ a model that combines data from language models, ontologies, and topic modeling. Results over a dataset of entities from four product domains show that the
proposed approach achieves significantly above baseline F-measure of 0.96.
Information Extraction --- An one hour summaryYunyao Li
This is the deck that I made when taking CS767 at Univ. of Michigan in 2006. While it is a few years' old, it is still a useful deck for people who are new to information extraction.
Adaptive Parser-Centric Text NormalizationYunyao Li
Wonderful work done with Congle Zhang (my summer intern in 2012) and my IBM colleagues. Nominated for best paper award and presented at ACL 2013.
Adaptive Parser-Centric Text Normalization
Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, Yunyao Li
Proceedings of ACL, pp. 1159--1168, 2013
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.