This document provides an agenda for the CITA'15 Workshop held in August 2015. The workshop schedule includes 4 sessions taking place between 8:30 am and 5:00 pm with morning and afternoon breaks. The workshop agenda covers topics such as big data analytics, open data, semantic data description using ontologies and RDF, and a case study on converting a dataset to linked open data. The format of the workshop will be interactive with exercises and discussion encouraged.
Larry will discuss what data science means in general, and more specifically at Udemy. He will describe some key data science frameworks, and what it means for them to be agile. He will also discuss ideally what it would mean to be a data scientist at Udemy.
The document summarizes the typical evolution of data processing at a startup company and provides details about data engineering at Udemy. It describes how companies initially struggle with data before establishing scalable data infrastructure and workflows. At Udemy, they use AWS Redshift as their data warehouse, ingest data from various sources using Python ETL pipelines scheduled through Pinball, and use Hadoop/EMR for batch processing and AWS Kinesis for real-time processing. Lessons learned include starting with batch processing, considering the type of data, and storing data in a log format for debugging.
The document discusses how GitLab.com builds its data services and products. It describes how GitLab.com uses its own DevOps platform to build an Enterprise Data Platform that analyzes data from GitLab.com. The data team faces challenges around scaling, visibility, and speed. To address these, the team takes actions like open sourcing tools, adopting DevOps practices, and establishing roles, processes, and technologies to build a trusted data model and framework. The key takeaways emphasize continuous iteration, discipline, automation, and living the company values.
ModCloth uses Tableau to enable stakeholders across the company to access and analyze data independently. By training stakeholders in Tableau, the data team is able to focus on more complex analyses while stakeholders can answer questions with same-day data. Some challenges in training include different skill levels and goals amongst stakeholders. ModCloth addresses this through tailored trainings and office hours. Since implementing stakeholder training, the data team spends less time on routine tasks and more on modeling and products while stakeholders complete over 200 additional requests per quarter in Tableau.
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)Mark Rittman
A presentation from ODTUG 2013 on tools other than OBIEE for Exalytics, focusing on analysis of non-traditional data via Endeca, "big data" via Hadoop and statistical analysis / predictive modeling through Oracle R Enterprise, and the benefits of running these tools on Oracle Exalytics
Washington DC DataOps Meetup -- Nov 2019DataKitchen
This document discusses challenges with current data analytics practices and how adopting a DataOps approach can help address them. It notes that current practices often involve many people using complex, fragmented toolchains which results in high error rates, slow deployment speeds, and an inability to deliver insights at the speed of business. DataOps is presented as a way to transform data analytics by applying practices from DevOps and Lean manufacturing like continuous integration, monitoring, version control systems, and reusable components. The document provides a seven step framework for implementing DataOps along with additional considerations for architecture, metrics, and collaboration.
Joe Caserta was a featured speaker, along with MIT Sloan School faculty and other industry thought-leaders. His session 'You're the New CDO, Now What?' discussed how new CDOs can accomplish their strategic objectives and overcome tactical challenges in this emerging executive leadership role.
In its tenth year, the MIT CDOIQ Symposium 2016 continues to explore the developing role of the Chief Data Officer.
For more information, visit http://casertaconcepts.com/
This document provides recommendations for launching a successful advanced analytics program in a mature industry. It recommends:
1. Starting at the top by gaining executive support and framing analytics as a strategic capability.
2. Finding the right people by hiring data scientists and multi-disciplinary talent, as well as growing internal skills, but prioritizing industry knowledge over tool expertise.
3. Getting organized by establishing processes, connecting different parts of the organization, and promoting internal evangelism for analytics.
Larry will discuss what data science means in general, and more specifically at Udemy. He will describe some key data science frameworks, and what it means for them to be agile. He will also discuss ideally what it would mean to be a data scientist at Udemy.
The document summarizes the typical evolution of data processing at a startup company and provides details about data engineering at Udemy. It describes how companies initially struggle with data before establishing scalable data infrastructure and workflows. At Udemy, they use AWS Redshift as their data warehouse, ingest data from various sources using Python ETL pipelines scheduled through Pinball, and use Hadoop/EMR for batch processing and AWS Kinesis for real-time processing. Lessons learned include starting with batch processing, considering the type of data, and storing data in a log format for debugging.
The document discusses how GitLab.com builds its data services and products. It describes how GitLab.com uses its own DevOps platform to build an Enterprise Data Platform that analyzes data from GitLab.com. The data team faces challenges around scaling, visibility, and speed. To address these, the team takes actions like open sourcing tools, adopting DevOps practices, and establishing roles, processes, and technologies to build a trusted data model and framework. The key takeaways emphasize continuous iteration, discipline, automation, and living the company values.
ModCloth uses Tableau to enable stakeholders across the company to access and analyze data independently. By training stakeholders in Tableau, the data team is able to focus on more complex analyses while stakeholders can answer questions with same-day data. Some challenges in training include different skill levels and goals amongst stakeholders. ModCloth addresses this through tailored trainings and office hours. Since implementing stakeholder training, the data team spends less time on routine tasks and more on modeling and products while stakeholders complete over 200 additional requests per quarter in Tableau.
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)Mark Rittman
A presentation from ODTUG 2013 on tools other than OBIEE for Exalytics, focusing on analysis of non-traditional data via Endeca, "big data" via Hadoop and statistical analysis / predictive modeling through Oracle R Enterprise, and the benefits of running these tools on Oracle Exalytics
Washington DC DataOps Meetup -- Nov 2019DataKitchen
This document discusses challenges with current data analytics practices and how adopting a DataOps approach can help address them. It notes that current practices often involve many people using complex, fragmented toolchains which results in high error rates, slow deployment speeds, and an inability to deliver insights at the speed of business. DataOps is presented as a way to transform data analytics by applying practices from DevOps and Lean manufacturing like continuous integration, monitoring, version control systems, and reusable components. The document provides a seven step framework for implementing DataOps along with additional considerations for architecture, metrics, and collaboration.
Joe Caserta was a featured speaker, along with MIT Sloan School faculty and other industry thought-leaders. His session 'You're the New CDO, Now What?' discussed how new CDOs can accomplish their strategic objectives and overcome tactical challenges in this emerging executive leadership role.
In its tenth year, the MIT CDOIQ Symposium 2016 continues to explore the developing role of the Chief Data Officer.
For more information, visit http://casertaconcepts.com/
This document provides recommendations for launching a successful advanced analytics program in a mature industry. It recommends:
1. Starting at the top by gaining executive support and framing analytics as a strategic capability.
2. Finding the right people by hiring data scientists and multi-disciplinary talent, as well as growing internal skills, but prioritizing industry knowledge over tool expertise.
3. Getting organized by establishing processes, connecting different parts of the organization, and promoting internal evangelism for analytics.
How the world of data analytics, science and insights is failing and how the principles from Agile, DevOps, and Lean are the way forward. #DataOps Given at DevOps Enterprise Summit 2019
Many companies start their big data and AI journey by hiring a team of data scientists, give them some data, and expect them to work their miracles. Although it may yield results, it is not an efficient way to use data scientists. We will explain the problems that occur, and how to adapt the context to get business value from data scientists.
- Why data science teams might fail to deliver results
- What data scientists need to be efficient
- What talent you need in addition to data scientists
Do Agile Data in Just 5 Shocking Steps!DataKitchen
For over 10 years, we have been doing agile for software development yet people struggle to do agile for data, BI, and analytics. After a quick review of the agile manifesto and principles, this talk looks at which agile practices have worked for data and which are still hard. Then, with analyst requirements in mind, this talk reveals the 5 shocking steps to actually do agile with data.
Knowledge extraction and incorporation is currently considered to be beneficial for efficient Big Data analytics. Knowledge can take part in workflow design, constraint definition, parameter selection and configuration, human interactive and decision-making strategies. Here we present BIGOWL, an ontology to support knowledge management in Big Data analytics. BIGOWL is designed to cover a wide vocabulary of terms concerning Big Data analytics workflows, including their components and how they are connected, from data sources to the analytics visualization. It also takes into consideration aspects such as parameters, restrictions and formats. This ontology defines not only the taxonomic relationships between the different concepts, but also instances representing specific individuals to guide the users in the design of Big Data analytics workflows. For testing purposes, two case studies are developed, which consists in: first, real-world streaming processing with Spark of traffic Open Data, for route optimization in urban environment of New York city; and second, data mining classification of an academic dataset on local/cloud platforms. The analytics workflows resulting from the BIGOWL semantic model are validated and successfully evaluated.
Moving Past Infrastructure Limitations Presented by MediaMath
This presentation was given at a Big Data Warehousing Meetup with Caserta Concepts, MediaMath and Qubole. You can learn more about the event here: http://www.meetup.com/Big-Data-Warehousing/events/228372516/
Event description:
At Caserta Concepts, we are firm believers in big data thriving on the cloud. The instant-on, nearly unlimited storage and computing capabilities of AWS has made it the defacto solution for a full spectrum of organizations needing to process large amounts of data.
What's more, an ecosystem of value-added platforms has emerged to further ease and democratize the implementation of cloud based solutions. Qubole has developed a great platform for easily deploying and managing ephemeral and long-lived Hadoop and Spark clusters on AWS.
Moving Past Infrastructure Limitations: Data Warehousing at MediaMath
Over the past year and a half, MediaMath has undertaken a “data liberation” effort in an attempt to leave their bigbox, monolithic data warehouse behind. In this talk, Rory Sawyer, Software Engineer at MediaMath, will describe how this effort transformed MediaMath’s legacy architecture and legacy mindset, which imposed harsh inefficiencies on data sharing and utilization. The current mindset removes these inefficiencies and allows them to say “yes” to more projects and ideas.
Rory will also demo how MediaMath uses Amazon Web Services and Qubole so that infrastructure is no longer a limiting factor on what and how users query. This combination allows them to scale their resources up and down as needed while bridging different data sources and execution engines. Using and extending MediaMath’s data warehousing is no longer a privileged activity but an ability that every employee and client has.
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
This document discusses applying agile principles and practices to data and analytics teams to address the complexity they face. It outlines seven steps to doing agile data work: 1) adding tests, 2) modularizing and containerizing work, 3) using branching and merging, 4) employing multiple environments, 5) giving analysts tools to experiment, 6) using simple storage, and 7) supporting small team, feature branch, and data governance workflows. The goal is to enable rapid experimentation and integration of new data sources through these agile practices adapted for analytics teams and their unique needs.
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
2018 Women in Analytics Conference
https://www.womeninanalytics.org/
Over the last year I’ve become obsessed with learning how to be a better "cloud computing evangelist to data scientists" - specifically to the R community. I’ve learned that this isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of changing up the tools and workflows they’ve come to rely on when those systems seem to be working. Resistance to change increases even further with barriers to quick adoption, such as having to teach yourself a completely new technology or framework. I’d like to give a talk about how working in the cloud changes data science and how exploring these tools can lead to a world of new possibilities within the intersection of DevOps and Data Analytics.
Topics to discuss:
- Working through functionality/engineering challenges with R in a cloud environment
- Opportunities to customize and craft your ideal version of R/RStudio
- Making and embracing a decision on what is “real" about your analysis or daily work (Chapter 6 in R for Data Science)
- Running multiple R instances in the cloud (why would you want to do this?)
- Becoming an R/Data Science Collaboration wizard: Building APIs with Plumber in the Cloud
In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.
1. Spil Games uses a bottoms-up monthly forecasting process where ARIMA models in R are used to generate initial traffic forecasts for 500 markets/channels which are then loaded into Tableau for exploratory analysis and adjustment.
2. Key business users explore and modify the forecasts in Tableau before the adjusted forecast is loaded back into the data warehouse.
3. Forecasting considers factors like seasonality, known events, and regressors to predict metrics like traffic, gameplays, pageviews, and advertising across markets/channels on a monthly basis.
New times, new hype. Buzzwords like big data and Hadoop have been changed to AI and machine learning. But it's not technology, old or new, nor machine learning that separates companies that get value from data from the companies that struggle .
When big data was at its peak, several young, technology-intensive companies succeeded in absorbing big data successfully. They acquired large Hadoop clusters, learned to master data and created valuable products with machine learning. However, big data has had a limited impact at traditional companies, and the list of long and expensive data lake and Hadoop projects is long.
The key to implementing successful projects that transform data into business value is to democratise data - making it accessible and easy to use within an organisation.
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
The document outlines seven steps for implementing DataOps to improve data analytics projects: 1) orchestrate the data journey from access to production, 2) add automated tests and monitoring, 3) use version control for code, 4) enable branching and merging of code, 5) use multiple environments, 6) reuse and containerize components, and 7) parameterize processing. It also discusses three additional steps: data architecture, inter- and intra-team collaboration, and process analytics for measurement. The goal of DataOps is to increase project success rates by integrating testing, monitoring, collaboration and automation practices across the entire data and analytics workflow.
H2O for Medicine and Intro to H2O in PythonSri Ambati
Erin LeDell presents on machine learning for medicine using the H2O platform. She discusses how electronic health records, genomic data, medical images, and data from wearables can be used with machine learning for applications like predictive diagnostics, prognosis, and remote patient monitoring. H2O is an open source machine learning platform that provides algorithms like deep learning, random forests, and gradient boosting in an easy to use interface. It demonstrates an EEG example to predict eye state from brain signals.
Watch the companion webinar for this presentation at http://embt.co/KLopez826. In this webinar, Karen Lopez of InfoAdvisors will cover 10 tips for the modern data architect and resources for coming up to speed on these new approaches. She will share how modern data modeling approaches address both SQL (relational) and NoSQL technologies. We'll look at the role of a data modeler, and how models, processes and data governance processes can add value to enterprise big data and NoSQL development projects.
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
Smart companies know that business intelligence surfaces insights. With complex analytics, data mining and everything in between, it takes many moving parts to serve up the big picture. The key is to provide full-stack visibility into the entire BI environment, ensuring solid service and system performance.
Learn more at http://www.insideanalysis.com
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
This document provides an overview of big data concepts and technologies for managers. It discusses problems with relational databases for large, unstructured data and introduces NoSQL databases and Hadoop as solutions. It also summarizes common big data applications, frameworks like MapReduce, Spark, and Flink, and different NoSQL database categories including key-value, column-family, document, and graph stores.
Text Analytics & Linked Data Management As-a-ServiceMarin Dimitrov
slides from the talk on "Text Analytics & Linked Data Management As-a-Service with S4" from the ESWC'2015 workshop on Semantic Web Enterprise Adoption & Best Practices
full paper available at http://2015.wasabi-ws.org/papers/wasabi15_1.pdf
ALLDATA 2015 - RDF Based Linked Data Management as a DaaS PlatformSeonho Kim
suggesting a way to manage linked data platform to be used domain specific applications
Best parper awarded - http://www.iaria.org/conferences2015/AwardsALLDATA15.html
How the world of data analytics, science and insights is failing and how the principles from Agile, DevOps, and Lean are the way forward. #DataOps Given at DevOps Enterprise Summit 2019
Many companies start their big data and AI journey by hiring a team of data scientists, give them some data, and expect them to work their miracles. Although it may yield results, it is not an efficient way to use data scientists. We will explain the problems that occur, and how to adapt the context to get business value from data scientists.
- Why data science teams might fail to deliver results
- What data scientists need to be efficient
- What talent you need in addition to data scientists
Do Agile Data in Just 5 Shocking Steps!DataKitchen
For over 10 years, we have been doing agile for software development yet people struggle to do agile for data, BI, and analytics. After a quick review of the agile manifesto and principles, this talk looks at which agile practices have worked for data and which are still hard. Then, with analyst requirements in mind, this talk reveals the 5 shocking steps to actually do agile with data.
Knowledge extraction and incorporation is currently considered to be beneficial for efficient Big Data analytics. Knowledge can take part in workflow design, constraint definition, parameter selection and configuration, human interactive and decision-making strategies. Here we present BIGOWL, an ontology to support knowledge management in Big Data analytics. BIGOWL is designed to cover a wide vocabulary of terms concerning Big Data analytics workflows, including their components and how they are connected, from data sources to the analytics visualization. It also takes into consideration aspects such as parameters, restrictions and formats. This ontology defines not only the taxonomic relationships between the different concepts, but also instances representing specific individuals to guide the users in the design of Big Data analytics workflows. For testing purposes, two case studies are developed, which consists in: first, real-world streaming processing with Spark of traffic Open Data, for route optimization in urban environment of New York city; and second, data mining classification of an academic dataset on local/cloud platforms. The analytics workflows resulting from the BIGOWL semantic model are validated and successfully evaluated.
Moving Past Infrastructure Limitations Presented by MediaMath
This presentation was given at a Big Data Warehousing Meetup with Caserta Concepts, MediaMath and Qubole. You can learn more about the event here: http://www.meetup.com/Big-Data-Warehousing/events/228372516/
Event description:
At Caserta Concepts, we are firm believers in big data thriving on the cloud. The instant-on, nearly unlimited storage and computing capabilities of AWS has made it the defacto solution for a full spectrum of organizations needing to process large amounts of data.
What's more, an ecosystem of value-added platforms has emerged to further ease and democratize the implementation of cloud based solutions. Qubole has developed a great platform for easily deploying and managing ephemeral and long-lived Hadoop and Spark clusters on AWS.
Moving Past Infrastructure Limitations: Data Warehousing at MediaMath
Over the past year and a half, MediaMath has undertaken a “data liberation” effort in an attempt to leave their bigbox, monolithic data warehouse behind. In this talk, Rory Sawyer, Software Engineer at MediaMath, will describe how this effort transformed MediaMath’s legacy architecture and legacy mindset, which imposed harsh inefficiencies on data sharing and utilization. The current mindset removes these inefficiencies and allows them to say “yes” to more projects and ideas.
Rory will also demo how MediaMath uses Amazon Web Services and Qubole so that infrastructure is no longer a limiting factor on what and how users query. This combination allows them to scale their resources up and down as needed while bridging different data sources and execution engines. Using and extending MediaMath’s data warehousing is no longer a privileged activity but an ability that every employee and client has.
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
This document discusses applying agile principles and practices to data and analytics teams to address the complexity they face. It outlines seven steps to doing agile data work: 1) adding tests, 2) modularizing and containerizing work, 3) using branching and merging, 4) employing multiple environments, 5) giving analysts tools to experiment, 6) using simple storage, and 7) supporting small team, feature branch, and data governance workflows. The goal is to enable rapid experimentation and integration of new data sources through these agile practices adapted for analytics teams and their unique needs.
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
2018 Women in Analytics Conference
https://www.womeninanalytics.org/
Over the last year I’ve become obsessed with learning how to be a better "cloud computing evangelist to data scientists" - specifically to the R community. I’ve learned that this isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of changing up the tools and workflows they’ve come to rely on when those systems seem to be working. Resistance to change increases even further with barriers to quick adoption, such as having to teach yourself a completely new technology or framework. I’d like to give a talk about how working in the cloud changes data science and how exploring these tools can lead to a world of new possibilities within the intersection of DevOps and Data Analytics.
Topics to discuss:
- Working through functionality/engineering challenges with R in a cloud environment
- Opportunities to customize and craft your ideal version of R/RStudio
- Making and embracing a decision on what is “real" about your analysis or daily work (Chapter 6 in R for Data Science)
- Running multiple R instances in the cloud (why would you want to do this?)
- Becoming an R/Data Science Collaboration wizard: Building APIs with Plumber in the Cloud
In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.
1. Spil Games uses a bottoms-up monthly forecasting process where ARIMA models in R are used to generate initial traffic forecasts for 500 markets/channels which are then loaded into Tableau for exploratory analysis and adjustment.
2. Key business users explore and modify the forecasts in Tableau before the adjusted forecast is loaded back into the data warehouse.
3. Forecasting considers factors like seasonality, known events, and regressors to predict metrics like traffic, gameplays, pageviews, and advertising across markets/channels on a monthly basis.
New times, new hype. Buzzwords like big data and Hadoop have been changed to AI and machine learning. But it's not technology, old or new, nor machine learning that separates companies that get value from data from the companies that struggle .
When big data was at its peak, several young, technology-intensive companies succeeded in absorbing big data successfully. They acquired large Hadoop clusters, learned to master data and created valuable products with machine learning. However, big data has had a limited impact at traditional companies, and the list of long and expensive data lake and Hadoop projects is long.
The key to implementing successful projects that transform data into business value is to democratise data - making it accessible and easy to use within an organisation.
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
The document outlines seven steps for implementing DataOps to improve data analytics projects: 1) orchestrate the data journey from access to production, 2) add automated tests and monitoring, 3) use version control for code, 4) enable branching and merging of code, 5) use multiple environments, 6) reuse and containerize components, and 7) parameterize processing. It also discusses three additional steps: data architecture, inter- and intra-team collaboration, and process analytics for measurement. The goal of DataOps is to increase project success rates by integrating testing, monitoring, collaboration and automation practices across the entire data and analytics workflow.
H2O for Medicine and Intro to H2O in PythonSri Ambati
Erin LeDell presents on machine learning for medicine using the H2O platform. She discusses how electronic health records, genomic data, medical images, and data from wearables can be used with machine learning for applications like predictive diagnostics, prognosis, and remote patient monitoring. H2O is an open source machine learning platform that provides algorithms like deep learning, random forests, and gradient boosting in an easy to use interface. It demonstrates an EEG example to predict eye state from brain signals.
Watch the companion webinar for this presentation at http://embt.co/KLopez826. In this webinar, Karen Lopez of InfoAdvisors will cover 10 tips for the modern data architect and resources for coming up to speed on these new approaches. She will share how modern data modeling approaches address both SQL (relational) and NoSQL technologies. We'll look at the role of a data modeler, and how models, processes and data governance processes can add value to enterprise big data and NoSQL development projects.
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
Smart companies know that business intelligence surfaces insights. With complex analytics, data mining and everything in between, it takes many moving parts to serve up the big picture. The key is to provide full-stack visibility into the entire BI environment, ensuring solid service and system performance.
Learn more at http://www.insideanalysis.com
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
This document provides an overview of big data concepts and technologies for managers. It discusses problems with relational databases for large, unstructured data and introduces NoSQL databases and Hadoop as solutions. It also summarizes common big data applications, frameworks like MapReduce, Spark, and Flink, and different NoSQL database categories including key-value, column-family, document, and graph stores.
Text Analytics & Linked Data Management As-a-ServiceMarin Dimitrov
slides from the talk on "Text Analytics & Linked Data Management As-a-Service with S4" from the ESWC'2015 workshop on Semantic Web Enterprise Adoption & Best Practices
full paper available at http://2015.wasabi-ws.org/papers/wasabi15_1.pdf
ALLDATA 2015 - RDF Based Linked Data Management as a DaaS PlatformSeonho Kim
suggesting a way to manage linked data platform to be used domain specific applications
Best parper awarded - http://www.iaria.org/conferences2015/AwardsALLDATA15.html
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
Watch the full session: Denodo DataFest 2016 sessions: https://goo.gl/Bvmvc9
Data prep and data blending are terms that have come to prominence over the last year or two. On the surface, they appear to offer functionality similar to data virtualization…but there are important differences!
In this session, you will learn:
• How data virtualization complements or contrasts technologies such as data prep and data blending
• Pros and cons of functionality provided by data prep, data catalog and data blending tools
• When and how to use these different technologies to be most effective
This session is part of the Denodo DataFest 2016 event. You can also watch more Denodo DataFest sessions on demand here: https://goo.gl/VXb6M6
This document provides an overview of a course on building a real data product. The goals are to apply skills from previous courses, use a real-world dataset to build a predictive model, and present the completed project. Students will build a book recommendation system using Goodreads data. The workflow involves ingesting data, wrangling it, building a recommender model using matrix factorization, and creating reports with the results.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch here: https://bit.ly/3719Bi7
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
-How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
-About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
No doubt Visualization of Data is a key component of our industry. The path data travels since it is created till it takes shape in a chart is sometimes obscure and overlooked as it tends to live in the engineering side (when volume is relevant), an area where Data Scientist tend to visit but not the usual Web/Marketing Data Analyst. Nowadays the options to tame all that journey and make the best of it are many and they don't require extensive engineering knowledge. Small or Big Data, let's see what "Store, Extract, Transform, Load, Visualize" is all about.
Advanced Project Data Analytics for Improved Project DeliveryMark Constable
Data Analytics is already beginning to impact how projects are delivered. We can now automate minute taking and capturing actions, we can use Flow to progress chase, Power BI reduces the burden of reporting.
But we are just scratching the surface. It won’t be long before we can leverage the rich dataset of experience to predict what risks are likely to occur, understand which WBS elements will be susceptible to variance, deduce what the optimum resource profile looks like, define a schedule by leveraging data from those projects that have gone before.
The role of a project professional is about to change dramatically. In this webinar we will explore the challenges and opportunities, and how we should respond. It’s a call-to-action for the community to mobilise, help to reshape project delivery and understand the implications for you and your organisation.
Presenter Martin Paver is a Chartered Project Professional, APM Fellow and Chartered Engineer. In December 2017 he established the London Project Data Analytics meetup which has quickly spread across the UK and expanded to 3000+ members. Martin has major project experience including leading a $billion projects with a team of 220 and a multi-billion PMO with a team of 50. He has a detailed grasp of project management and combines this with a broad understanding of recent developments in the field of data science. He is on a mission to ensure that the project management profession readies itself for a transformed future.
Learning outcomes:
- Understand the implications of advanced data analytics on project delivery
- Understand the scope of which functions it is likely to impact
- Help you to develop a strategy for how you engage with it
- Understand how to leverage the benefits and opportunities that will emerge from it
Presenter:
Martin Paver, CEO & Founder, Projecting Success Ltd
This presentation discusses data visualization tools and Kwantu's approach to data visualization for monitoring and evaluation. It covers why visualizing data is useful, where to start, different visualization tools for different experience levels and budgets, and Kwantu's approach of collecting validated data at the lowest level to enable real-time reporting and visualization without additional data processing. Kwantu's technology choices include tools for building data collection forms, taxonomies, dynamic data queries, workflows, report building, and planned integration with the Kibana visualization framework to create interactive visualizations updated in real-time.
Using text analytics to manage mobile qual to manage mobile Qual Data - CivicomMerlien Institute
Presented by Mike Timmerman, Business Analyst, Civicom
at Market Research in the Mobile World Europe
8 - 11 October 2013, London, Europe
This event is proudly organised by Merlien Institute
Check out our upcoming events by visiting http://www.mrmw.net
This document provides an overview of Visual Analytics Session 3. It discusses data joining and blending in Tableau. Specifically, it explains why joining or blending data is necessary when data comes from multiple sources. It then describes the different types of data joins in Tableau - inner joins, left joins, right joins, and outer joins. An example is provided to demonstrate an inner join using a primary key to connect related data between two tables. The goal is to understand how to connect different but related data sources in Tableau using common keys or variables.
This document discusses how a company called Peloton Therapeutics uses Dotmatics software to extend its drug discovery capabilities. It provides examples of using Dotmatics Exe Runner and studies to perform docking simulations and generate patent tables. It also describes how data can be loaded into databases and pivoted for analysis. Key aspects covered include configuring Oracle triggers to initiate jobs, setting up studies forms, and using the Dotmatics Pivot Engine for unpivoted data.
This document provides an overview of a lecture on big data analytics given by Dr. Ching-Yung Lin. The key points covered in the lecture include:
- Definitions and characteristics of big data based on the 3V's of volume, velocity and variety.
- Techniques used for big data such as massive parallelism, distributed storage and processing, machine learning and data visualization.
- Factors that have enabled big data to become prominent in recent years like greater data collection, open source software and commodity hardware.
- Examples of big data platforms, databases and analytics techniques including Hadoop, Spark, NoSQL databases and graph databases.
- The large and growing market for big data
This document summarizes an event hosted by the Cincinnati Tableau User Group. The agenda includes a Tableau training session led by local users covering multiple Tableau topics and techniques. This will be followed by a Q&A panel where multiple Tableau users will answer questions about Tableau Desktop and Server. The document provides details on the event hosts, trainers, topics to be covered in the training, and information on how to join the Cincinnati Tableau User Group.
Tableau Drive, A new methodology for scaling your analytic cultureTableau Software
Tableau Drive is a methodology for scaling out self-service analytics. Drive is based on best practices from successful enterprise deployments. The methodology relies on iterative, agile methods that are faster and more effective than traditional long-cycle deployment. A cornerstone of the approach is a new model of a partnership between business and IT.
The Drive Methodology is available for free. Some organizations will choose to execute Drive themselves; others will look to Tableau Services or Tableau Partners for expert help.
Data is growing exponentially. What should business managers do to make better business decisions? I explain three key things step by step. Just start today!
Lean Analytics is a set of rules to make data science more streamlined and productive. It touches on many aspects of what a data scientist should be and how a data science project should be defined to be successful. During this presentation Richard will present where data science projects go wrong, how you should think of data science projects, what constitutes success in data science and how you can measure progress. This session will be loaded with terms, stories and descriptions of project successes and failures. If you're wondering whether you're getting value out of data science, how to get more value out of it and even whether you need it then this talk is for you!
What you will take away from this session
Learn how to make your data science projects successful
Evaluate how to track progress and report on the efficacy of data science solutions
Understand the role of engineering and data scientists
Understand your options for processes and software
Presentation: Study: #Big Data in #Austria, Mario Meir-Huber, Big Data Leader Eastern Europe, Teradata GmbH & Martin Köhler, Austrian Institute of Technology, AIT (AT), at the European Data Economy Workshop taking place back to back to SEMANTiCS2015 on 15 September 2015 in Vienna.
This document summarizes the progress made by the National Information Standards Organization (NISO) in developing standards for new metrics in scholarship, known as altmetrics. It discusses how NISO held discussions and meetings with over 400 contributors to brainstorm ideas and reach consensus on key elements needed to build trust in metrics, including defining what is counted, how it is identified, aggregation procedures, and data exchange standards. The goal is to establish standardized approaches and definitions that can facilitate consistent measurement and comparison of the broader impacts of scholarly work.
1. CITA’15 Workshop, August 2015
Semantic Enrichment of
Unstructured Datasets
Bebo White
SLAC National Accelerator Laboratory/
Stanford University
bebo@slac.stanford.edu
3. CITA’15 Workshop,August 2015
Workshop Agenda (1/3)
• Overview of “Big Data Analytics”
• Goals
• Common challenges
• Examples and applications
• What is missing
• Big Data and Open Data
• Characteristics of open (and semantic) data
• Usage
• Challenges
• Processes
4. CITA’15 Workshop,August 2015
Workshop Agenda (2/3)
• Semantically describing date
• Ontologies and namespaces
• Data triples
• Triplification
• Introduction to RDF(S)
• Case Study - FOAF
• Merging RDF data
• RDF tools
5. CITA’15 Workshop,August 2015
Workshop Agenda (3/3)
• PingER as a triplification case study
• Introduction to project
• PingER LOD
• Data model and process
• PingER LOD “data bloating”
• How PingER LOD extends PingER
• Summary and lessons learned
6. CITA’15 Workshop,August 2015
Workshop Format
• A workshop, not a tutorial
• Goal is to introduce concepts and
terminology and provoke future research
• Must be very interactive - questions/
discussion at any time
• Individual and group exercises
• Length of workshop depends on involvement
7. CITA’15 Workshop,August 2015
“High-volume, -velocity, and -variety information assets
that demand cost-effective innovative forms of information
processing for enhanced insight and decision making”
8. CITA’15 Workshop,August 2015
• Volume?
• ~ data volume worldwide in 2013 = 3.5 ZB
(including 400 billion feature length HD movies)
• Velocity?
• Every 60 sec. on Facebook - 510K posted
comments; 293K status updates; 136K uploaded
photos
• 30 billion shares
• 20 million apps installed
9. CITA’15 Workshop,August 2015
• Variety?
• Any type of data both meaningful and
meaningless
• Veracity?
• How is trust established?
• What does “like” really mean?
10. CITA’15 Workshop,August 2015
Evaluating “theV’s”
• A recent survey conducted by Paradigm4 indicates
• variety, not volume, is the bigger challenge of
analyzing Big Data - 71% of respondents
• Data Scientists aren’t terribly concerned with
the “size” of the data being currently analyzed -
tools and systems are in place to work with
large datasets
• storing large amounts of structured (or semi-
structured) data is not the problem, analysis is
11. CITA’15 Workshop,August 2015
Common Challenges of
Harnessing Big Data
• Mining huge (?) datasets
• Shortages of Big Data experts
• Privacy, legal, and social issues
• Strategies for acquiring Big Data - a new
form of currency
• BUT
12. CITA’15 Workshop,August 2015
“The theory is that you pump Big Data into the
‘black box’ of an analytics engine - most likely
hidden on some unknown server in the cloud -
and you get back a continuous stream of
insights”
13. CITA’15 Workshop,August 2015
“When you have large amounts of data your
appetite for hypotheses tends to get larger.And
if it’s growing faster than the statistical strength
of the data, then many of your inferences are
likely to be false.They are likely to be ‘white
noise.’ We have to have error bars around our
predictions.”
-Michael Jordan
14. CITA’15 Workshop,August 2015
Why is “Bigger Data”
Better?
• Outliers or small
clusters
• Rare discrete values or
classes
• Missing values
• Rare events or objects
17. CITA’15 Workshop,August 2015
Unstructured Data
• Does not have a pre-defined data model or
is not organized in a pre-defined manner
• Typically text-heavy, but may contain data
such as dates, numbers, and facts
• May result in irregularities and ambiguities
that make it difficult to understand using
traditional programs
(Ref: Wikipedia)
18. CITA’15 Workshop,August 2015
Typical Big Data Problem
• Iterate over a large number of records
• Extract something of interest from each
(MAP)
• Shuffle and sort immediate results
• Aggregate immediate results (REDUCE)
• Generate final output
21. CITA’15 Workshop,August 2015
MapReduce
Implementations
• Google has a proprietary implementation in C++
• Bindings in Java, Python
• Hadoop is an open-source implementation in Java
• Development led byYahoo!, now an Apache project
• Used in production atYahoo!, Facebook,Twitter,
LinkedIn, Netflix, etc.
• The de facto Big Data processing platform
• Lots of custom research implementations
22. CITA’15 Workshop,August 2015
An Interesting Example -
“Sentiment Analysis”
• Goal - gauging mood on social network
data
• Not a traditional survey or focus group
• Social sites operate 24/7
• Timeliness - not subject to time lags
• Useful to marketers, IT, customers, etc. - a
limited (not general) sector
23. CITA’15 Workshop,August 2015
Difficult Comment
Analysis (1/2)
• False negatives - “crying” & “crap” (negative) vs.
“crying with joy” & “holy crap!” (positive)
• Relative sentiment - “I bought a Honda Accord” -
great for Honda, bad for Toyota
• Compound sentiment - “I love the phone but hate
the network”
• Conditional sentiment - “If someone doesn’t call
me back, I’m never doing business with them
again!”
24. CITA’15 Workshop,August 2015
Difficult Comment
Analysis (2/2)
• Scoring sentiment - “I like it” vs.“I really like it” vs.
“I love it”
• Sentiment modifiers - “I bought an iPhone
today :-)” “Gotta love the telephone company ;-<“
• International/cultural sentiments
• Japanese - unique emoticons for crying - (;_;)
• Italians - effusive, grandiose
• British - drier, less effusive
28. CITA’15 Workshop,August 2015
There is a 5th “V”
VALUE
Despite it’s volume, veracity, etc., what does
it really give us?
How can we extract insight/knowledge?
33. CITA’15 Workshop,August 2015
Looking back…
• One of the great (IMHO) insights in Web
2.0 was developing mashups
• Supported the process of converting data
to knowledge/insight
• Usually done in an ad hoc manner, e.g.,
“screen scraping”
• Sometimes done with APIs
34. CITA’15 Workshop,August 2015
Is it possible to do “Data
Programming?”
• Can processes extract from data pools the
same insights that humans do?
• How do humans process collections of
data?
36. CITA’15 Workshop,August 2015
• Big Data (and even “not so Big Data”) tends
to be unstructured data (e.g., lists, e-mails,
tweets, etc.)
• Therefore it tends to be “thin” rather than
“thick”
• “Thin” means very little (if any) context -
just data, little knowledge
• What can be added to change data from
“thin” to “thick?”
37. CITA’15 Workshop,August 2015
9 Steps to Extract Insight
from Unstructured Data (1/2)
1. Make sense of the disparate data sources*
2. Sign off on the method of analytics and find a
clear way to present the results
3. Decide the technology stack for data ingestion
and storage
4. Keep information in a data lake until it has to be
stored in a data warehouse
5. Prepare the data for storage
38. CITA’15 Workshop,August 2015
9 Steps to Extract Insight
from Unstructured Data (2/2)
6. Retrieve useful information
7. Ontology evaluation*
8. Statistical modeling and execution
9. Obtain insight from the analysis and
visualize it*
40. CITA’15 Workshop,August 2015
Linked Data
• Provides access to the semantics of data
items
• Based upon Semantic Web technologies and
ontologies
• Designed for machines first and humans later
• Degree of structure in descriptions of things
is high
42. CITA’15 Workshop,August 2015
Linked Data Pros
• Far more “parseable” and “machine
processable” than raw unstructured data
• Enhances data descriptions for complex
analyses
• Can contribute to theVERACITY of our data
• Wide variety of discipline/data ontologies
available
43. CITA’15 Workshop,August 2015
Linked Data Cons
• Much harder to do than adding keyword
metadata
• Building efficient processing applications
and parsers
• Implementing effective linked data stores
44. CITA’15 Workshop,August 2015
Linked Open Data
• LOD refers to data stores of Linked Data that are
published (made available online and accessed via URLs)
and free to use
• Open data means it must be available to all without
copyright or ownership
• There is an increasing trend towards “opening”
government data (US and UK, San Francisco and more)
and scientific results
• Provides unprecedented ability to build “mashup”
applications
46. CITA’15 Workshop,August 2015
How do we do this?
• By defining unambiguously the relationships
between data items
• By using a shared definition and meaning
mechanism
• By expressing the semantics and syntax
inherent in the data
55. CITA’15 Workshop,August 2015
Fundamental Concepts
(1/2)
• Modeling - making sense of unorganized
information/data
• Formality/Informality - the degree to which
the meaning of a modeling language is given
independent of the particular speaker or
audience
56. CITA’15 Workshop,August 2015
Fundamental Concepts
(2/2)
• Commonality andVariability - how to
manage things in common and some with
important differences
• Expressivity - the ability of a modeling
language to express maximum variety in
the model
57. CITA’15 Workshop,August 2015
Tabular Data About Elizabethan
Literature and Music
ID Title Author Medium Year
1
As You
Like It
Shakespeare Play 1599
2 Hamlet Shakespeare Play 1604
3 Othello Shakespeare Play 1603
4
“Sonnet
78”
Shakespeare Poem 1609
60. CITA’15 Workshop,August 2015
Ontology/Vocabulary
(1/2)
• Provides a common background and
understanding of a particular domain or field
of study, and ensures a common ground
among those who study the information
• A way of organizing concepts, information,
and ideas that is meant to be universal
within the field and allows for a common
language to be spoken
61. CITA’15 Workshop,August 2015
Ontology/Vocabulary
(2/2)
• A structural framework that allows
concepts to be laid out in a way that makes
sense
• Shows the connections and relationships
between concepts in a manner that is
generally accepted by the field
64. CITA’15 Workshop,August 2015
Sample Triples
Shakespeare wrote King Lear
Shakespeare wrote Macbeth
Anne Hathaway married Shakespeare
Shakespeare livedIn Stratford
Stratford isIn England
Macbeth setIn Scotland
England partOf UK
Scotland partOf UK
66. CITA’15 Workshop,August 2015
Linked Data Technology
Stack
• URIs - Universal Resource Indicators
(generalization of URL)
• HTTP - HyperText Transport Protocol
• RDF - Resource Description Framework/
Format
• RDFS/OWL - RDF Schema/Web Ontology
Language
67. CITA’15 Workshop,August 2015
Linked Data Principles
(1/2)
• Use URIs as names of things
• Anything, not just documents
• Information resources and non-information
resources
• Use HTTP URIs
• Globally unique names, distributed ownership
• Allows people to look up those names
68. CITA’15 Workshop,August 2015
Linked Data Principles
(2/2)
• Provide useful information in RDF
• When someone (or something) looks up
a URI
• Include RDF links to other URIs
• To enable discovery of related
information
69. CITA’15 Workshop,August 2015
Plays of Shakespeare
with Qnames
Subject Predicate Object
lit:Shakespeare lit:wrote lit:Hamlet
lit:Shakespeare lit:wrote lit:Othello
lit:Shakespeare lit:wrote lit:WintersTale
… … …
71. CITA’15 Workshop,August 2015
Triples Referring to URIs
with aVariety of Namespaces
Subject Predicate Object
lit:Shakespeare lit:wrote lit:Hamlet
bio:AnneHathaway bio:married bio:Shakespeare
geo:Stratford geo:isIn geo:England
geo:England geo:partOf geo:UK
72. CITA’15 Workshop,August 2015
Reengineering process - data
to data triples (‘triplification’)
Define/acquire
data source
Meta-Model
Define/acquire
mapping description
Apply
reengineering
Data
Source
Data
Source
+
Mapping
RDF
Dataset
73. CITA’15 Workshop,August 2015
RDF and the Semantic
Web
• Supports the goal of the Semantic Web
• Web information/data should have exact
and unambiguous meaning
• Web information/data can be understood
and processing by computers
• Computers can integrate information/
data from multiple sources on the Web
74. CITA’15 Workshop,August 2015
What is RDF?
• Resource Description Framework
• Provides a model for data and a syntax so that
independent parties can exchange and use it
• Designed mainly to be read and understood by
computer processors, not humans
• Written in XML
• A W3C Recommendation
• Any XML processor or parser can use
76. CITA’15 Workshop,August 2015
Basic Ideas Behind RDF
• RDF uses Web identifiers (URIs) to identify
resources
• RDF describes resources with properties
and property values
• Everything is represented as triples
• The essence of RDF is the (s,p,o) triple
77. CITA’15 Workshop,August 2015
RDF Data Model
• Any expression in RDF is a collection of triples (subject, predicate,
object)
• A set of triples is called an RDF graph
• The nodes of an RDF graph are its subjects and objects
• Direction is important - always points to object
• An assertion of an RDF triple says the relationship (as indicated by
the predicate) holds between subject and object
• The meaning of an RDF graph is conjunction (AND) of the
statements corresponding to all the triples it contains
• RDF does not provide means to express negation (NOT) or
disjunction (OR)
78. CITA’15 Workshop,August 2015
RDF Design Goal
• Having a simple data model
• Having formal semantics and provable inference
• Using an extensible URI-based vocabulary
• Using an XML-based syntax
• Supporting use of XML Schema datatypes
• Allowing anyone to make statements about any
resource
80. CITA’15 Workshop,August 2015
Case Study - FOAF
• Friend-of-a-Friend
• A linked data description
of a person
• More than just a blog or
personal Web page
90. CITA’15 Workshop,August 2015
Semantic Mashups
• A mashup application using Semantic Web
technologies inside
• Supplements Web 2.0 mashups by adding
access to semantic data sources
• Can be either client-side or server-side
93. CITA’15 Workshop,August 2015
Case Study: PingER
• PingER (Ping End-to-end Reporting)
• Uses the Internet ping facility to monitor performance of
Internet links worldwide
• Measures
• Short and long term RTT
• Packet loss percentages
• Jitter
• Lack of reachability (no response to ping)
• Throughput and quality of IP telephony (VoIP)
98. CITA’15 Workshop,August 2015
PingER Monitor Node
Format (original)
Monitor
Host
Nme
Monitor
Address
Remote
Name
Remote
Address
Bytes Time Xmt Rcv Min Avg Max
minos.slac.
stanford.edu
134.79.196.
100
www.lbl.gov 128.3.7.14 100 870393602 10 10 6 18 125
99. CITA’15 Workshop,August 2015
• Bytes - can be 100 or 1000 (min 100); number of bytes in
each ping packet
• Time - Unix Epochal time and is GMT (UDT)
• Xmt - number of ping packets sent
• Rcv - number of ping packets received
• Min - minimum response time for packets sent (in
milliseconds)
• Avg - average response time for packets sent (in milliseconds)
• Max - maximum response time for packets sent (in
milliseconds)
100. CITA’15 Workshop,August 2015
PingER Monitor Node
Format (revised)
• Same as original plus
• for each ping response the Sequence
number is recorded
• for each ping the RTT (round trip time) is
recorded
101. CITA’15 Workshop,August 2015
PingER Rules
• There should always be >7 tokens in the
line
• If <=7 tokens, site considered unreachable
• If no response to the pings are received,
only 8 tokens and Rcv (the 8th token) will
be 0
106. CITA’15 Workshop,August 2015
Possible Uses of PingER
Data
• Technical
• Economical
• Troubleshooting
• Collaboration
• Quantifying the impact of events
• Routing
107. CITA’15 Workshop,August 2015
Workshop Exercise
• Given a table of (unstructured) data
• Produce an RDF graph that reflects the
content in such a way that the information
intent is preserved but the data is now
available for RDF operations such as merging
with other linked datasets and RDF query
• Think of new applications that this
“triplification” might add to the use of PingER
data and what parties might be interested
122. CITA’15 Workshop,August 2015
What Did We Do?
Data in various formats
Data represented in
abstract format
Applications
Manipulate
Query
…
Map, Expose….