James Dixon has a unique perspective on the big data space - he coined the term "data lake." In this on-demand webinar the Big Data Maverick talks big data - watch to learn more about the technology landscape and evolving use cases. He covers topics such as:
- What are today's technologies of choice - where did they come from and why?
- Why is the emergence and definition of these use cases so important?
- What technologies are likely to come next?
- Why did the data explosion start and will it continue?
- Why are data scientists in such huge demand?
- What is the role of open source in big data, and the role of big data in open source?
This document contains information about Project Gutenberg, including a summary of its goals to distribute 1 trillion free electronic texts by 2001. It provides statistics on the number of texts produced and readers reached each month. It also includes legal information for users of Project Gutenberg texts, such as disclaimers of warranty and limitations of liability.
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview PresentationPentaho
Preview of the Strata + Hadoop World Strata San Jose 2016 session about truly scalable and automated data onboarding for Hadoop
Attend the presentation at the conference to learn how to tackle repeatable, self-service Hadoop ingestion without coding
Filling the Data Lake
Thursday, March 31 11:50a-12:30p
Room 230B
http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/50677
30 for 30: Quick Start Your Pentaho EvaluationPentaho
These slides are from our recent 30 for 30 webinar tailored towards people that have downloaded the Pentaho evaluation and want to know more about all the data integration and business analytics components part of the trial, how to easily integrate data, and best practices for installing/developing content.
Explore how data integration (or “mashups”) can maximize analytic value and help business teams create streamlined data pipelines that enables ad-hoc analytic inquiries. You’ll learn why businesses increasingly focused on blending data on demand and at the source, the concrete analytic advantages that this approach delivers, and the type of architectures required for delivering trusted, blended data. We provide a checklist to assess your data integration needs and capabilities, and review some real-world examples of how blending various data types has created significant analytic value and concrete business impact.
Pentaho is an open source business intelligence suite founded in 2004 that provides reporting, online analytical processing (OLAP) analysis, data integration, dashboards, and data mining capabilities. It can be downloaded for free from pentaho.com or sourceforge.net. Pentaho's commercial open source model eliminates licensing fees and provides annual subscription support and services. Key features include flexible reporting, a report designer, ad hoc reporting, security roles, OLAP analysis, ETL workflows, drag-and-drop data integration, alerts, and data mining algorithms.
School introduction to mis is the most important class in the amrit47
This document provides an overview of why the Introduction to MIS (Management Information Systems) course is important for business students. It discusses how technological changes like Moore's Law, Metcalfe's Law, and others are fundamentally changing businesses and driving corporate profitability. It argues that future business professionals need to be able to assess, evaluate, and apply emerging information technologies to business. The document explains how technological changes can disrupt even large, successful companies if they do not adapt. It emphasizes that the only real job security is having marketable skills, like abstract reasoning, systems thinking, collaboration and experimentation, that this course aims to develop in students.
A solution to controlling and dealing with the dark data inspired by biologic...Ali Alizade Haghighi
Today, Many global companies show clearly their tendency to analyze the disposal data and exploit it in order to enlarge their money-making machines. Now, We attempt to control the Dark Data and turning it into a pre-processed Big Data with new strategies. We try to create the ability to process and classify the Dark Data inspired by Biological Phenomena...
This document contains information about Project Gutenberg, including a summary of its goals to distribute 1 trillion free electronic texts by 2001. It provides statistics on the number of texts produced and readers reached each month. It also includes legal information for users of Project Gutenberg texts, such as disclaimers of warranty and limitations of liability.
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview PresentationPentaho
Preview of the Strata + Hadoop World Strata San Jose 2016 session about truly scalable and automated data onboarding for Hadoop
Attend the presentation at the conference to learn how to tackle repeatable, self-service Hadoop ingestion without coding
Filling the Data Lake
Thursday, March 31 11:50a-12:30p
Room 230B
http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/50677
30 for 30: Quick Start Your Pentaho EvaluationPentaho
These slides are from our recent 30 for 30 webinar tailored towards people that have downloaded the Pentaho evaluation and want to know more about all the data integration and business analytics components part of the trial, how to easily integrate data, and best practices for installing/developing content.
Explore how data integration (or “mashups”) can maximize analytic value and help business teams create streamlined data pipelines that enables ad-hoc analytic inquiries. You’ll learn why businesses increasingly focused on blending data on demand and at the source, the concrete analytic advantages that this approach delivers, and the type of architectures required for delivering trusted, blended data. We provide a checklist to assess your data integration needs and capabilities, and review some real-world examples of how blending various data types has created significant analytic value and concrete business impact.
Pentaho is an open source business intelligence suite founded in 2004 that provides reporting, online analytical processing (OLAP) analysis, data integration, dashboards, and data mining capabilities. It can be downloaded for free from pentaho.com or sourceforge.net. Pentaho's commercial open source model eliminates licensing fees and provides annual subscription support and services. Key features include flexible reporting, a report designer, ad hoc reporting, security roles, OLAP analysis, ETL workflows, drag-and-drop data integration, alerts, and data mining algorithms.
School introduction to mis is the most important class in the amrit47
This document provides an overview of why the Introduction to MIS (Management Information Systems) course is important for business students. It discusses how technological changes like Moore's Law, Metcalfe's Law, and others are fundamentally changing businesses and driving corporate profitability. It argues that future business professionals need to be able to assess, evaluate, and apply emerging information technologies to business. The document explains how technological changes can disrupt even large, successful companies if they do not adapt. It emphasizes that the only real job security is having marketable skills, like abstract reasoning, systems thinking, collaboration and experimentation, that this course aims to develop in students.
A solution to controlling and dealing with the dark data inspired by biologic...Ali Alizade Haghighi
Today, Many global companies show clearly their tendency to analyze the disposal data and exploit it in order to enlarge their money-making machines. Now, We attempt to control the Dark Data and turning it into a pre-processed Big Data with new strategies. We try to create the ability to process and classify the Dark Data inspired by Biological Phenomena...
This lecture provides an introduction to big data, including its definition, sources, and key characteristics. Big data refers to extremely large data sets that are difficult to process using traditional methods. It comes from everywhere and is growing exponentially due to people, devices, sensors, and organizations constantly generating data. The key characteristics that define big data are volume, velocity, and variety. Volume refers to the extremely large size of data, velocity is the speed at which data is generated and processed, and variety means data comes in all types of formats. Examples are given of how various companies and domains work with big data and the challenges it poses in terms of capturing, storing, analyzing and visualizing such large and diverse datasets in a timely manner.
Detailed description of big data, with the characteristics of it. What are the limitations of the traditional systems? Where we are using big data? And also the applications of big data.
This document discusses big data and Hadoop frameworks for managing large volumes of data. It begins with an overview of how data generation has increased exponentially from employees to users to machines. Next, it discusses the history of big data technologies like Google File System and MapReduce, which were combined to create Hadoop. The document then covers sources of big data, challenges of big data, and how Hadoop provides a solution through distributed processing and its core components like HDFS and MapReduce. Finally, data processing techniques with traditional databases versus Hadoop are compared.
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
If you ask people what BIG DATA is they often say it is about a lot of data. But the world has ALWAYS had a lot of data. It is about datafication – a word so new even spellcheck functions don’t know it is a real word!
Learn more about:
» How BIG DATA changes career paths of even the most unsuspecting?
» How BIG DATA changes the way business decision are made?
» How BIG DATA changes who makes those decisions & the reshuffle of the balance of power it causes?
» What BIG DATA skills can you bring to the office tomorrow to increase your value to the firm
This document summarizes a presentation about the future of the internet between 2011-2020. The presentation discusses 10 key trends that will shape the internet: e-government 2.0, creating a culture of innovation, cloud computing, the internet of things, big data, transparency, consumerization, people-centered design, community, and the need for updated policies. The presentation argues the internet is less than 20% developed and the next decade will see as much or more change than the last 20 years, including ubiquitous wireless connectivity and over 100 billion internet-connected devices.
This document discusses the key characteristics of Big Data - volume, variety, velocity, and veracity. It provides examples and explanations of each characteristic. Volume refers to the large amount of data. Variety means the different types and sources of data. Velocity is about the speed at which data is processed. Veracity relates to the quality and trustworthiness of the data. The document emphasizes that understanding these characteristics is important for effectively managing and analyzing Big Data.
This document provides a 3-5 year projection for technology trends in enterprise IT (EIT) based on analysis from experts and current market conditions. Key points include:
- EIT is currently a $2.1 trillion global market dominated by software, devices, and outsourcing.
- Cloud computing and software-as-a-service (SaaS) are rising significantly and most experts predict SaaS will capture the largest share of the business market.
- By 2020, the boundaries between on-premise and cloud deployment may disappear, and technologies like artificial intelligence, autonomous systems, and predictive analytics will be more widely adopted. Data management is also expected to converge across structured and unstructured
Big Data Story - From An Engineer's PerspectiveHien Luu
The document provides an overview of big data from an engineer's perspective. It discusses how (1) the amount of data created daily is exponentially growing, with 90% created in the last two years, (2) data is transforming how we live and work through opportunities in areas like social networking, ecommerce, and smart technologies, and (3) big data is fueling innovation through capabilities like prediction, recommendation, and detection using algorithms.
How is digital preservation typically presented to attract the interest of the mainstream audience? We select some recurring themes from a recent article and try to interpret the meaning for non-specialists.
Slides to prompt discussion at the first KeepIt project meeting, 2 June 2009, by Steve Hitchcock, KeepIt project manager
Data centers are large, secure buildings that house servers and networking equipment to store and process data. They provide critical infrastructure to support cloud computing and store the vast amounts of data generated every day by individuals, companies, and apps. Data centers use advanced cooling and power systems to operate banks of computer servers efficiently and are located strategically worldwide for reliability and proximity to users. They provide high-tech jobs in computer systems, engineering, security, and business management.
This document is a briefing of the Conference Exponential Manufacturing organized by Singularity University in may 2016. We enrieched it with examples and articles by our own.
What's the Big Deal About Big Data?.pdfSteven Jong
If a CPU is a computer, then there are now hundreds of computers for each person on earth, all of them collecting raw data, some at prodigious rates. The total amount of data in the world is doubling every two years, and the uses to which that data can be put are amazing. As a market segment, Big Data is growing faster than the software industry. Documenting Big Data is thus a lucrative and interesting field for technical communicators.
What is Big Data and how is it fundamentally different from storing files? In what fields is Big Data being used today? What are some examples of how the data can be analyzed and used? Are there any ethical issues involved? What skills and knowledge do you need to know to break into the field? The presenter, who has worked for several years for a storage vendor, offers his experience and perspective.
The document discusses how the growing Internet of Things (IoT) and increase in data collection will impact businesses. It notes that while IoT and big data are not revolutionary on their own, together they will require changes in how data is managed and analyzed. Specifically, it argues that to succeed with the rise of IoT, systems must be optimized to reduce data transfers, data must be fragmented into smaller transactions instead of bulk transfers, and data must be made widely accessible through open platforms and tools. The document provides examples of how companies like Netflix, Facebook, and others have optimized data handling and argues this approach will be needed as IoT devices proliferate into the billions.
The document discusses how the growing Internet of Things (IoT) and increase in data collection will impact businesses. It notes that while IoT and big data are not revolutionary on their own, together they will require changes in how data is managed and analyzed. Specifically, it argues that to succeed with the rise of IoT, systems must be optimized to reduce data transactions, data must be fragmented into smaller pieces to ease analysis, and data must be made widely accessible through open platforms and tools. The document cautions that failing to properly manage the growing amounts of connected devices and data could lead to security risks and negatively impact businesses.
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, risks and benefits. Big data is characterized by volume, velocity and variety of structured and unstructured data that is growing exponentially. It is generated from sources like mobile devices, sensors, social media and more. Tools like Hadoop, MapReduce and data analytics are used to extract value from big data. Potential applications include healthcare, security, manufacturing and more. Risks include privacy and scale, while benefits include improved decision making and new business opportunities. The big data industry is rapidly growing and transforming IT and business.
1. The document discusses big data problems faced by various domains like science, government, and private organizations.
2. It defines big data based on the 3Vs - volume, velocity, and variety. Volume alone is not sufficient, and these factors must be considered together.
3. Traditional databases are not suitable for big data problems due to issues with scalability, structure of data, and hardware limitations. Distributed file systems like Hadoop are better solutions as they can handle large and varied datasets across multiple nodes.
The document provides an introduction to the AP Computer Science Principles Explore Performance Task (PT). It outlines the two main components of the Explore PT - a computational artifact and written responses. It highlights two prompts that will require students to discuss the beneficial and harmful effects of computing innovations and how they use data, themes of the current unit. The document explains that students will practice the skills needed for the Explore PT throughout the unit before completing it at the end.
This document provides an introduction to a training course on big data analytics. It discusses why big data has become important due to the exponential growth in data volume, velocity, and variety. The course aims to focus on cloud-based storage and processing of big data using systems like HDFS, MapReduce, HBase and Storm. It emphasizes that learning involves actively asking questions. Big data is introduced by explaining the three V's of volume, velocity and variety. Examples of big data usage are given in areas like baseball analytics, political campaigns and election predictions. Challenges of big data integration and processing large volumes of heterogeneous data are also covered.
Check out this presentation from Pentaho and ESRG to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies.
Learn more in the brief that inspired the presentation, Product Innovation with Big Data: http://www.pentaho.com/resources/whitepaper/product-innovation-big-data
What's in store for Big Data in 2015? Will the 'Internet of Things' fuel the Industrial Internet? Will Big Data get Cloudy? Check out the top five Big Data predictions for 2015 according to Quentin Gallivan, CEO, Pentah0
This lecture provides an introduction to big data, including its definition, sources, and key characteristics. Big data refers to extremely large data sets that are difficult to process using traditional methods. It comes from everywhere and is growing exponentially due to people, devices, sensors, and organizations constantly generating data. The key characteristics that define big data are volume, velocity, and variety. Volume refers to the extremely large size of data, velocity is the speed at which data is generated and processed, and variety means data comes in all types of formats. Examples are given of how various companies and domains work with big data and the challenges it poses in terms of capturing, storing, analyzing and visualizing such large and diverse datasets in a timely manner.
Detailed description of big data, with the characteristics of it. What are the limitations of the traditional systems? Where we are using big data? And also the applications of big data.
This document discusses big data and Hadoop frameworks for managing large volumes of data. It begins with an overview of how data generation has increased exponentially from employees to users to machines. Next, it discusses the history of big data technologies like Google File System and MapReduce, which were combined to create Hadoop. The document then covers sources of big data, challenges of big data, and how Hadoop provides a solution through distributed processing and its core components like HDFS and MapReduce. Finally, data processing techniques with traditional databases versus Hadoop are compared.
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
If you ask people what BIG DATA is they often say it is about a lot of data. But the world has ALWAYS had a lot of data. It is about datafication – a word so new even spellcheck functions don’t know it is a real word!
Learn more about:
» How BIG DATA changes career paths of even the most unsuspecting?
» How BIG DATA changes the way business decision are made?
» How BIG DATA changes who makes those decisions & the reshuffle of the balance of power it causes?
» What BIG DATA skills can you bring to the office tomorrow to increase your value to the firm
This document summarizes a presentation about the future of the internet between 2011-2020. The presentation discusses 10 key trends that will shape the internet: e-government 2.0, creating a culture of innovation, cloud computing, the internet of things, big data, transparency, consumerization, people-centered design, community, and the need for updated policies. The presentation argues the internet is less than 20% developed and the next decade will see as much or more change than the last 20 years, including ubiquitous wireless connectivity and over 100 billion internet-connected devices.
This document discusses the key characteristics of Big Data - volume, variety, velocity, and veracity. It provides examples and explanations of each characteristic. Volume refers to the large amount of data. Variety means the different types and sources of data. Velocity is about the speed at which data is processed. Veracity relates to the quality and trustworthiness of the data. The document emphasizes that understanding these characteristics is important for effectively managing and analyzing Big Data.
This document provides a 3-5 year projection for technology trends in enterprise IT (EIT) based on analysis from experts and current market conditions. Key points include:
- EIT is currently a $2.1 trillion global market dominated by software, devices, and outsourcing.
- Cloud computing and software-as-a-service (SaaS) are rising significantly and most experts predict SaaS will capture the largest share of the business market.
- By 2020, the boundaries between on-premise and cloud deployment may disappear, and technologies like artificial intelligence, autonomous systems, and predictive analytics will be more widely adopted. Data management is also expected to converge across structured and unstructured
Big Data Story - From An Engineer's PerspectiveHien Luu
The document provides an overview of big data from an engineer's perspective. It discusses how (1) the amount of data created daily is exponentially growing, with 90% created in the last two years, (2) data is transforming how we live and work through opportunities in areas like social networking, ecommerce, and smart technologies, and (3) big data is fueling innovation through capabilities like prediction, recommendation, and detection using algorithms.
How is digital preservation typically presented to attract the interest of the mainstream audience? We select some recurring themes from a recent article and try to interpret the meaning for non-specialists.
Slides to prompt discussion at the first KeepIt project meeting, 2 June 2009, by Steve Hitchcock, KeepIt project manager
Data centers are large, secure buildings that house servers and networking equipment to store and process data. They provide critical infrastructure to support cloud computing and store the vast amounts of data generated every day by individuals, companies, and apps. Data centers use advanced cooling and power systems to operate banks of computer servers efficiently and are located strategically worldwide for reliability and proximity to users. They provide high-tech jobs in computer systems, engineering, security, and business management.
This document is a briefing of the Conference Exponential Manufacturing organized by Singularity University in may 2016. We enrieched it with examples and articles by our own.
What's the Big Deal About Big Data?.pdfSteven Jong
If a CPU is a computer, then there are now hundreds of computers for each person on earth, all of them collecting raw data, some at prodigious rates. The total amount of data in the world is doubling every two years, and the uses to which that data can be put are amazing. As a market segment, Big Data is growing faster than the software industry. Documenting Big Data is thus a lucrative and interesting field for technical communicators.
What is Big Data and how is it fundamentally different from storing files? In what fields is Big Data being used today? What are some examples of how the data can be analyzed and used? Are there any ethical issues involved? What skills and knowledge do you need to know to break into the field? The presenter, who has worked for several years for a storage vendor, offers his experience and perspective.
The document discusses how the growing Internet of Things (IoT) and increase in data collection will impact businesses. It notes that while IoT and big data are not revolutionary on their own, together they will require changes in how data is managed and analyzed. Specifically, it argues that to succeed with the rise of IoT, systems must be optimized to reduce data transfers, data must be fragmented into smaller transactions instead of bulk transfers, and data must be made widely accessible through open platforms and tools. The document provides examples of how companies like Netflix, Facebook, and others have optimized data handling and argues this approach will be needed as IoT devices proliferate into the billions.
The document discusses how the growing Internet of Things (IoT) and increase in data collection will impact businesses. It notes that while IoT and big data are not revolutionary on their own, together they will require changes in how data is managed and analyzed. Specifically, it argues that to succeed with the rise of IoT, systems must be optimized to reduce data transactions, data must be fragmented into smaller pieces to ease analysis, and data must be made widely accessible through open platforms and tools. The document cautions that failing to properly manage the growing amounts of connected devices and data could lead to security risks and negatively impact businesses.
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, risks and benefits. Big data is characterized by volume, velocity and variety of structured and unstructured data that is growing exponentially. It is generated from sources like mobile devices, sensors, social media and more. Tools like Hadoop, MapReduce and data analytics are used to extract value from big data. Potential applications include healthcare, security, manufacturing and more. Risks include privacy and scale, while benefits include improved decision making and new business opportunities. The big data industry is rapidly growing and transforming IT and business.
1. The document discusses big data problems faced by various domains like science, government, and private organizations.
2. It defines big data based on the 3Vs - volume, velocity, and variety. Volume alone is not sufficient, and these factors must be considered together.
3. Traditional databases are not suitable for big data problems due to issues with scalability, structure of data, and hardware limitations. Distributed file systems like Hadoop are better solutions as they can handle large and varied datasets across multiple nodes.
The document provides an introduction to the AP Computer Science Principles Explore Performance Task (PT). It outlines the two main components of the Explore PT - a computational artifact and written responses. It highlights two prompts that will require students to discuss the beneficial and harmful effects of computing innovations and how they use data, themes of the current unit. The document explains that students will practice the skills needed for the Explore PT throughout the unit before completing it at the end.
This document provides an introduction to a training course on big data analytics. It discusses why big data has become important due to the exponential growth in data volume, velocity, and variety. The course aims to focus on cloud-based storage and processing of big data using systems like HDFS, MapReduce, HBase and Storm. It emphasizes that learning involves actively asking questions. Big data is introduced by explaining the three V's of volume, velocity and variety. Examples of big data usage are given in areas like baseball analytics, political campaigns and election predictions. Challenges of big data integration and processing large volumes of heterogeneous data are also covered.
Check out this presentation from Pentaho and ESRG to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies.
Learn more in the brief that inspired the presentation, Product Innovation with Big Data: http://www.pentaho.com/resources/whitepaper/product-innovation-big-data
What's in store for Big Data in 2015? Will the 'Internet of Things' fuel the Industrial Internet? Will Big Data get Cloudy? Check out the top five Big Data predictions for 2015 according to Quentin Gallivan, CEO, Pentah0
With the combination of Pentaho and MongoDB, it’s drastically simpler and faster to build single analytical views of clients by aggregating and blending data from a variety of internal sources (customer, transaction, position data) and external sources (social networking, central bank, news, pricing) with fast response times.
Webinar covers:
An insider’s view of new ways financial services companies are using MongoDB to rapidly store and consume unlimited shapes and sizes of data
How Pentaho makes it easy to enrich data in MongoDB with predictive scoring, visual data integration tools, reports, interactive dashboards, and data visualizations
A live demo of blending Twitter, equity pricing, and news data into a single analytical view that unlocks market intelligence to create investment opportunities
Why Your Product Needs an Analytic Strategy Pentaho
The document discusses strategies for enhancing products with analytics capabilities. It outlines three strategic approaches: 1) enhance current software products with analytics, 2) target new opportunities using existing data through direct data monetization or new products/services, and 3) reinvent value propositions using new data technologies like big data. The document provides examples of implementing analytics capabilities for different user personas and considerations for analytics deployments. It argues that analytics can provide benefits like improved decisions, customer stickiness, and new revenue opportunities.
Users and customers don't just want products and services anymore - they also want the data and analytics that are under the hood! The good news is that delivering value with data is more achievable than ever before thanks to greater access to diverse data sources and the ability to process, blend, and refine data at unprecedented scale.
Improving the Business of Healthcare through Better Analytics Pentaho
The document discusses challenges in the US healthcare system and how big data analytics can help address these challenges. It then provides details on MedeAnalytics, a company that provides big data analytics solutions for healthcare. MedeAnalytics collects over 21 billion records from 10,000+ data feeds, manages over 300 terabytes of data, and serves over 900 hospital and payer clients. The document outlines MedeAnalytics' suite of solutions and how they help improve financial, operational, and clinical outcomes for healthcare providers and health plans.
Up Your Analytics Game with Pentaho and Vertica Pentaho
Big Data is a game-changer.
In the face of exploding volumes and varieties of data, traditional data management and ETL systems just aren’t cutting it anymore. A new way of sifting through vast volumes of data to find the most relevant info, combining this data with other data sources to extract faster insights is desperately needed. Enter HP|Vertica and Pentaho with a proven solution for lightning fast queries and blended data and analytics capabilities for your business users.
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho
Bo Borland presentation at MongoDB World in NYC, June 24, 2014. Data Integration and Advanced Analytics for MongoDB: Blend, Enrich and Analyze Disparate Data in a Single MongoDB View
This webinar discusses leveraging embedded analytics for customer success. It covers:
1) The challenges of gaining visibility into customer data and implementing new capabilities for customer success products.
2) How embedded analytics can help ISVs quickly add new customer success capabilities to increase customer adoption and consumption by supporting the customer lifecycle.
3) Examples of how embedded analytics could help with communities, customer service, and professional services automation for customer success.
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataPentaho
This document discusses a project between Pentaho and Verizon to leverage big data analytics. Verizon generates vast amounts of call detail record (CDR) data from mobile networks that is currently stored in a data warehouse for 2 years and then archived to tape. Pentaho's platform will help optimize the data warehouse by using Hadoop to store all CDR data history. This will free up data warehouse capacity for high value data and allow analysis of the full 10 years of CDR data. Pentaho tools will ingest raw CDR data into Hadoop, execute MapReduce jobs to enrich the data, load results into Hive, and enable analyzing the data to understand calling patterns by geography over time.
Predictive Analytics with Pentaho Data Mining - Análisis Predictivo con Penta...Pentaho
This webinar is in Spanish -
El uso de análisis predictivo o minería de datos está en auge. A nivel mundial, cada vez más, las empresas contratan servicios especializados de análisis de información que ayuden a marcar una diferencia con la competencia. Por otro lado, el volumen creciente de data así como su naturaleza cambiante y compleja, hacen inmanejable el proceso de análisis de forma tradicional y está siendo necesario incorporar tecnología y consultoría de punta, basada en el uso de modelos matemáticos avanzados. Pentaho Corporation y Matrix CPM Solutions los invita a participar en el seminario en línea “Análisis Predictivo con Pentaho Data Mining”, en donde se revisarán las grandes oportunidades que existen para su uso y aplicación.
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes:
1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces.
2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases.
3) The complexity of current implementation approaches that involve multiple coding steps across various tools.
4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
This document discusses getting started with big data analytics using Hadoop and Pentaho. It provides an overview of installing and configuring Hadoop and Pentaho on a single machine or cluster. Dell's Crowbar tool is presented as a way to quickly deploy Hadoop clusters on Dell hardware in about two hours. The document also covers best practices like leveraging different technologies, starting with small datasets, and not overloading networks. A demo is given and contact information provided.
The presentation discusses Pentaho Healthcare Solutions and how Pentaho business analytics can help address key issues in the healthcare industry. It highlights 7 BI trends in healthcare including consolidating information, leveraging new data resources, needing self-service data discovery tools, ease of use for non-technical users, users being mobile, professionalization through metrics and KPIs, and performing big data analytics on large varied datasets. It then provides examples of how Pentaho analytics can help with clinical excellence, improving patient satisfaction, compliance, and financial management. The presentation concludes by showcasing two customer use cases where Pentaho helped healthcare organizations and retailers gain insights and cost savings.
Pentaho Business Analytics for ISVs and SaaS providers in healthcarePentaho
The document discusses how business analytics capabilities are important for healthcare ISVs and SaaS providers to compete in the industry. It recommends that ISVs evaluate embedded business intelligence platforms to lower costs of goods sold over five years, improve customer adoption and satisfaction, and deliver more compelling products. The Pentaho OEM program is presented as an option for ISVs to gain world-class analytic capabilities for their offerings in a cost-effective manner through flexible business terms and a technology partner experienced in the healthcare sector.
1. The document discusses Pentaho's approach to big data analytics using a component-based data integration and visualization platform.
2. The platform allows business analysts and data scientists to prepare and analyze big data without advanced technical skills.
3. It provides a visual interface for building reusable data pipelines that can be run locally or deployed to Hadoop for analytics on large datasets.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Full-RAG: A modern architecture for hyper-personalization
The Next Big Thing in Big Data
1.
2.
3. Because I have a lot to cover there won’t be time for questions at the end. And
I’m guessing some of the question won’t have simple answers. So you can go
to my blog jamesdixon.wordpress.com where each of the sections is a
separate post that you can comment on or ask questions about.
4. First let’s look at the data explosion that everyone is talking about at the
moment.
5. This is a quote from a paper about the importance of data compression
because of data explosion. It seems reasonable. Store information as
efficiently as possible so that the effects of the explosion are manageable.
[TRANSITION]
This was written in 1969.
So the data explosion is not a new phenomenon. It has been going on since
the mid 60’s.
http://www.sciencedaily.com/releases/2013/05/130522085217.htm
6. This is another quote, much more recent, that you might see online. This says
that the amount of data being created and stored is multiplying by a factor of 10
every two years. I have not found any numerical data to back this up so I will
drill into this in a few minutes.
7. So consider this graph of data quantities. It looks like it might qualifies as a
data explosion. But this is actually just the underlying trend of data growth with
no explosion happening.
This graph is just shows hard drive sizes for home computers. Starting with the
first PCs with 10MB drives in 1983, and going up to a 512GB drive today.
Some of you might recognize that this exponential growth is the storage
equivalent of Moore’s law, which states that the computing power doubles
every 2 years. And we can see from these charts that hard drives have
followed along at the same rate.
This exponential growth in storage combines with a second principle.
8. http://www.sciencedaily.com/releases/2013/05/130522085217.htm
This statement is not just an ironic observation.
This effect is due to the fact that the amount of data stored is also affected by
Moore’s law. With twice the computing power, you can process images that are
twice as big. You can run applications with twice the logic. You can watch
movies with twice the resolution. You can play games that are twice as
detailed. All of these things require twice the space to store them.
Today an HD movie can be 3 or 4 gigabytes. In 2001 that was your entire hard
drive.
9. With processing power doubling at the same rate that storage is increasing
what does this say about any gap between the data explosion and the CPU
power required to process it?
12. If we divide the amount of data by the amount of processing power we get a
constant. We get a straight line. If this holds to be true then we will never drown
in our own data.
13. Can we really call it an explosion, if it is just a natural trend? We don’t talk
about the explosion of processing power – it’s just Moore’s law. Is there a new
explosion that is over and above the underlying trend. If so how big is it and will
it continue? We are going to find the answers to all of these questions. Before
we do there are some things to understand.
14. Firstly there is a point, for any one kind of data, where the explosion stops or
slows down. It is the point at which the data reaches its natural maximum
granularity, and beyond which there is little practical value to increasing the
granularity. I’m going to demonstrate this natural maximum using some well
known data types.
15. Let’s start with color. Back in the early 80s we went from black and white
computers to 16 color computers. The 16 color palette was a milestone
because each color needed to have a name, and most computer programmers
at the time couldn’t name that many colors. So we had to learn teal, and
fushcia, and cyan and magenta.
Then 256 colors arrived a few years later. Which was great because it was too
many colors to name, so we didn’t have to.
Then 4,000 colors. And within the decade we were up to 24-bit color with 16
million colors. Since then color growth has slowed down. 30-bit color came 10
years later, followed by 48-bit color a decade ago, with its 280 trillion colors.
But in reality most image and video editing software, and most images and
16. Because we see a similar thing with video resolutions. They have increased,
but not exponentially. The current standard is 4K, which has 4 times the
resolution of 1080p. With 20/20 vision 4K images exceed the power of the
human eyeball when you view them from a typical distance. The “retina”
displays on Apple products are called that because they have a resolution
designed to match the power of human vision. So images and video are just
reaching their natural maximum but these files will continue to grow in size as
we gradually reduce the compression and increase the quality of the content.
17. In terms of ability, the human hearing system lies in between 16-bit sound and
24-bit sound. So again we have hit the natural limit of this data type.
If you still don’t believe in the natural granularity have I one further example.
18. Dates. In the 60’s and 70’s we stored dates in COBOL as 6 digits. This gave
rise to the Y2K issue.
We managed to avoid that apocalypse. With 32-bit dates we extended the date
range by 38 years. But since the creation of 64bit systems and 64 bit dates, the
next crisis for dates is? Everyone should have this in their diary. It’s a Sunday
afternoon. December 4th. But what year? Anyone? It’s the year 292 billion blah
blah blah.
19. So this is the graph showing the natural granularity of dates for the next 290
billion years.
[TRANSITION]
For reference the green line shows the current age of the universe, which is 14
billions years.
20. So now that we understand that different data types have a natural maximum
granularity, how does it relate to big data and the data explosion?
21. Look at the example of a utility company that used to record your power
consumption once and month and now does it every 10 seconds. Your
household applicances, the dishwasher, fridge, oven, heating and air
conditioning, TVs, computers don’t turn on an off that often. The microwave
has the shortest duration, but usually 20 seconds is the shortest time it is on
for.
[TRANSITION]
So this seems like a reasonable natural maximum
22. Now let’s take a cardiologist who, instead of seeing a patient once a month to
record some data, now can get data recorded once a second, 24 hours a day.
Your heart rate, core temperature, and blood pressure don’t change on a sub-
second interval.
[TRANSITION]
So again this seems like a reasonable natural maximum
23.
24. As companies create a production big data system the amount of data stored
will increase dramatically until they have enough history stored – anywhere
from 2 to 10 years of data. Then the growth will reduce again. So the amount of
data will explode, or pop, over a period of a few years.
25. If this is your data before it pops
[TRANSITION]
Then this is your data after it pops
26.
27.
28. There are millions of companies in the world. If you only talk to the top 1000
companies in the USA you only get a very small view of the whole picture.
29. This brings us back to this claim, which aligns with the hype. How can we really
asses the growth in data?
30. My thought is that if the data explosion is really going at a rate of 10x every two
years, then HP, Dell, Cisco, and IBM must be doing really well, as these
manufacturers account for 97% of the blade server market in North America.
And Seagate, and Sandisk, and Fujitsu, and Hitachi must be doing well really
well too, as they make the storage. And Intel and AMD must be doing really
well because they make the processors.
Let’s look at HP who has 43% of worldwide market in blade servers.
31.
32.
33. From graphs of stock prices we can see that IBM, Cisco, Intel, EMC, and HP
don’t have growth rates that substantiate a data explosion.
34. When we look at memory and drive manufacturers the best of all of these is
Seagate and Micron with about a 200-300% growth over 5 years. That is a
multiplier of about a 1.7 year over year.
35. If we apply that multiplier of 1.7 to the underlying data growth trend we see that
the effect is noticeable but not really that significant. And that represents the
maximum growth of any vendor, so the actual growth will be less than this.
36.
37.
38. When we look at the computing industry from a high level we see a shift in
values, from hardware in the 60’s and 70’s with IBM as the king. To software
with Microsoft, then to tech products and solutions from companies like Google
and Apple and finally to products that are based purely on data like Facebook
and LinkedIn.
Over the same time periods we have seen statistics [TRANSITION] be
augmented with machine learning [TRANSITION] and more recently with deep
learning [TRANSITION]
The emergence of deep learning is interesting, because it provides
unsupervised or semi-supervised data manipulation for creating predictive
models. It is interesting because
39.
40. It’s like the difference between mining for gold when you can just hammer
lumps of it out of the ground, and panning for tiny gold flakes in a huge pile of
sand and stones
41. The number of data scientists is not increasing at the same rate as the amount
of data and the number of data analysis projects is. We are not doubling the
number of data scientists every two years. This is why Deep Learning is a big
topic at the moment because it automates part of the data science process.
The problem is that the tools and techniques are very complicated.
42. For an example of complexity here is a classical problem know as the German
Tank Problem
43. In the second world war, leading up to the D-Day invasions the allies wanted to
know how many tanks the Germans were making
44. So statistics was applied to the serial numbers found on captured and
destroyed tanks.
47. The results were very accurate.
[TRANSITION]
When intelligence reports estimated that 1,400 tanks were being produced per
month,
[TRANSITION]
the statistics estimated 273.
[TRANSITION]
The actual figure was later found to be 274.
48. This next example is one of the greatest early works in the field of operation
research. This is interesting for several reasons. Firstly because, with the
creation of Storm and Spark Streaming and other real-time technologies we
are seeing a dramatic increase in the number of real-time systems that include
advanced analytics, machine learning, and model scoring. But this field is not
new. The other reason this is interesting is that it shows that correctly
interpreting the analysis is not always obvious and is more important than
crunching the data.
49. In an effort to make bombers more effective each plane returning from a
mission was examined for bullet holes and a map showing the density of bullet
holes over the planes was generated.
Tally ho chaps, said the bomber command commanders, slap some lovely
armor on these beauties where-ever you see the bullet holes
[TRANSITION]
Hold on a minute, said one bloke. I do not believe you want to do that.
Well, who are you said the bomber command commanders?
My name is Abram Wald. I am hungarian statistician who sounds a lot like
Michael Caine for some reason.
Wald’s reasoning was that they should put the armor where there are no bullet
holes, because that’s where the planes that don’t make it back must be getting
hit. Which happened to be places like the cockpit and the engines.
50. I deliberately chose two examples from 70 years ago to show that the problems
of analysis and interpretation are not new, and they are not easy. In 70 years
we have managed to make tools more capable but not much easier. But this
has to change.
51. So these are my conclusions on data science. We have more and more data,
but not enough human power to handle it, so something has to change.
54. And that R now has as much interest as SAS and SPSS combined.
55. Up until recently there was more interest in MapReduce than Spark and so
today we see mainly MapReduce in production. But as we can see from the
chart this is likely to change soon.
56. The job market shows us similar data with the core Hadoop technologies
currently providing more than 3 /4 of the job opportunities
57. And also that Java, the language of choice for big data technologies, has the
largest slice of the open jobs positions
58. One issue that is not really solved well today is SQL on Big Data.
59. On the job market HBase is the most sought after skill set. But you can see
that Phoenix, which is the SQL interface for HBase, is not represented in terms
of jobs. This chart also shows that the many proprietary big data SQL solutions
are not sought after skills at the moment. We don’t have a good solution for
SQL on big data yet.
60. Today aspects of an application that relate to the value of the data are typically
a version 2 afterthought for application developers.
[TRANSITION]
This affects the design of both applications, and data analysis projects.
61. For a software applications, the value of the data is not factored in, the natural
granularity is not considered, and the data analysis is not part of the
architecture. So we see architectures like this. With a database and business
logic and a web user interface.
62. The data analysis has to be built as a separate system, which is created and
integrated after the fact.
At a high-level, it will be something like this for a big data project, given the the
charts and trends we saw earlier. We see commonly see Hadoop, MapReduce,
Hbase, and R.
63. So here are the summary points for today’s technology stack.
65. If data is more valuable than software,
[TRANSITION]
we should design the software to maximize the value of the data,
and not the other way around.
66. We should design applications with the purpose of storing and getting value
from the natural maximum granularity
67. We should provide access to the granular data for the purpose of analysis and
operational support
68. If data is where the value is, then the use and treatment of the data should be
factored into an application from the start.
[TRANSITION]
It should be a priority of version 1.
69. [TRANSITION]
Valuing the data more than the software is a new requirement.
[TRANSITION]
Which demands new applications
[TRANSITION]
Which need new architectures
70. To illustrate this lets take the example of Blockbuster as an old architecture.
Hollywood studios would create content that was packaged and loaded in
batches to Blockbuster stores. The consumer would then jump in their car
every time little Debbie wanted to watch the Little Mermaid. Notice that in this
architecture there are a number of physical barriers between the parts.
71. Today we can still watch a Hollywood movie on our TV just as the Blockbuster
model enabled. But we have more sources of content, and we have more
devices we can use. And the architecture for this is very different in the middle
layers.
72. Consider implementing YouTube using the Blockbuster architecture. You take a
cat video on your camera or phone.
[TRANSITION]
Then you spend months burning 10,000 DVDs.
[TRANSITION]
Next you go to Fedex and spend $25,000 to get your DVDs to the Blockbuster
stores.
[TRANSITION]
Each blockbuster store will need 950 miles of shelving to store 120 million
videos and will have to add shelving at the rate of 1 foot per second to handle
the incoming videos.
As you can see it is economically not viable, and physically impossible to
implement YouTube using the old architecture
73. Consider an Internet of Things architecture where you have millions of devices
that have their own state and communicate with each other and with the central
system in real time. These systems need to perform complex event processing,
and analysis of state, and use predictive and prescriptive analytics to avoid
failures. You cannot bolt all of this analysis to the outside of the system as an
after-thought, it has to be designed-in and it has to be embedded.
74. SQL on Big Data is a separate topic for the future. This problem will get solved,
it is only a matter of time before we have a robust and full-featured scale-out
relational database for Big Data.
When this happens it will have a negative effect on the current database
vendors.
It will also affect the niche big data vendors whose main advantage is query
performance.
But it will help both the traditional analytic vendors and the open source
ecosystem.
75. Overall I think scalable technology is more interesting and more powerful that
“big” technology that cannot scale down. Scalable technology allows you to
start small and to grow without having to re-write, re-design or re-tool when you
don’t have to.
76. So this is my prediction of future software architectures that combine the
application and the analysis into one. By recognizing the value of the data we
build the big data technologies into the stack. We don’t have them as a
separate architecture that is built afterwards.
77. So here are the summary points for tomorrow’s technology stack.
78. Let’s look at the big data use cases and consider why they will change in the
future
79. In the database world if you took 50 database administrators and described a
data problem to them they would probably come up with a small number of
architectures and schemas between them. Probably only 3 or 4 when you
discount the minor differences. This happens because, as a community, we
understand how to solve problems using this kind of technology. We been
doing it for a while, there are lots of examples and teachings and papers, and
we have come to a consensus.
In the big data world we have not got to that point yet. Today if you took 50 big
data engineers and described a problem you would get a large number of
potential solutions back. Maybe as many as 50. We don’t collectively have
enough experience of trying similar things with different architectures and
technologies. But the emergence of these use cases helps a lot, because now
we can categorize different solutions together, even though the actual problem
might be different. Once we can do that we can compare the solutions and get
a better understanding of what works well and what the best practices should
be.
80. There is a set of problems that can be solved by SQL on Big Data as we talked
about earlier
81. There is a second set of problems that can be solved using a data lake
approach. These include agile analytics, and rewinding an application or device
state and replying events.
82. A third set of solutions exist around processing streams of data in real-time
83. Obviously if we really value data, we should value big data as well.
84. And if we value big data, then it should be built into the system from the start.
So the off-to-the-side big data projects should not exist, with the exception of a
data warehouse solution.
85. Some of the big data use cases that are emerging today only exist because we
are not building big data into the applications. Once we do that we will see
some of the big data use cases change.
86. So to conclude this section, while big data use cases are important, the
architecture stack needs to change, and with big data built in, we will see some
of the big data use cases changing in the future.
87.
88. Big data would not exist without open source. The reason these big data
projects were created was because of a lack of existing products that were
scalable and cost-effective.
89. Many of these big data projects were created and donated by fourth generation
software companies – the companies who value data highly
91. Usage of Big Data by large enterprises fuels the news and excites the market
analysts and commentators, because this is who they talk to the most. The fact
that open source is not the main story is good because it makes acceptance of
open source an assumption, and not a point of discussion or contention.
93. Which benefits the open source ecosystem, So we have a feedback loop
where open source and the big data technologies both benefit
94. According to the most recent Black Duck “Future of Open Source” survey open
source is now used to build products and service offerings at 78% of
companies.
All of these statistics are trending in the favor of open source adoption
95. The Apache Foundation, the organization that stewards Hadoop, Hive, Hbase,
Cassandra, Spark and Storm, currently has over 160 projects. These are just
the Big Data projects.
96. This is the full list that includes the Apache HTTP server which has a 60%
market share of the web server market. Hadoop and Spark and Cassandra are
hugely popular in the Big Data space. We can expect more of these
technologies to become the standard or the default solution in their space.
97. Spark – in-memory general purpose large-scale processing engine
Kafka – cluster-based central data backbone
Samza – stream processing system
Mesos – cluster abstraction
98.
99. I hope you enjoyed this talk. Whether or agree or disagree with these ideas or
if you have questions you can comment on my blog at any time. Thank you all
for joining.