This is an excerpt of the "Tier-1 BI in the World of Big Data" by Thomas Kejser, Denny Lee, and Kenneth Lieu specific to the Yahoo! TAO Case Study published at: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
Real-time analytics is important for data-driven applications. Ampool provides an active data store (ADS) that can ingest data in real-time, analyze it using various engines, and serve the results concurrently. This eliminates "data blackout periods" and enables applications to use up-to-date information. Ampool's ADS is powered by Apache Geode and has connectors for ingesting and processing data. It supports both transactional and analytical workloads in memory for low-latency.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
This document discusses Hadoop and big data at LinkedIn. It provides an overview of LinkedIn's data infrastructure, which includes various technologies like Hadoop, HBase, and messaging systems. Hadoop is used for recommendations, data warehousing, and data science projects at LinkedIn. Some challenges of using Hadoop include user adoption of new technologies, enabling real-time processing, handling graph algorithms, and making large datasets accessible. The presentation concludes by emphasizing that Hadoop is best for offline processing while other technologies may be needed for real-time use cases.
A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Yahoo Developer Network
This document discusses data infrastructure on Hadoop. It outlines the current use of Hadoop for managing big data through applications like search indexes, machine learning, ads optimization and more. It also mentions the next wave of Hadoop including analytics warehouses, utilization, and storage efficiency.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
Real-time analytics is important for data-driven applications. Ampool provides an active data store (ADS) that can ingest data in real-time, analyze it using various engines, and serve the results concurrently. This eliminates "data blackout periods" and enables applications to use up-to-date information. Ampool's ADS is powered by Apache Geode and has connectors for ingesting and processing data. It supports both transactional and analytical workloads in memory for low-latency.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
This document discusses Hadoop and big data at LinkedIn. It provides an overview of LinkedIn's data infrastructure, which includes various technologies like Hadoop, HBase, and messaging systems. Hadoop is used for recommendations, data warehousing, and data science projects at LinkedIn. Some challenges of using Hadoop include user adoption of new technologies, enabling real-time processing, handling graph algorithms, and making large datasets accessible. The presentation concludes by emphasizing that Hadoop is best for offline processing while other technologies may be needed for real-time use cases.
A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Yahoo Developer Network
This document discusses data infrastructure on Hadoop. It outlines the current use of Hadoop for managing big data through applications like search indexes, machine learning, ads optimization and more. It also mentions the next wave of Hadoop including analytics warehouses, utilization, and storage efficiency.
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
This document discusses how Orbitz is using Hadoop to store and process large amounts of log and transaction data in a scalable and cost-effective way. It outlines how Hadoop enables applications like recommendations, page performance tracking, and user segmentation. The goal is to integrate Hadoop with their existing enterprise data warehouse to provide a unified view of data and leverage existing business intelligence tools. Examples of processing pipelines and use cases for web analytics, beta, and click data are provided.
Apache Spark is an open source big data processing framework that is faster than Hadoop, easier to use, and supports more types of analytics. It provides high-level APIs, can run computations directly in memory for faster performance, and supports a variety of data processing workloads including SQL queries, streaming data, machine learning, and graph processing. Spark also has a large ecosystem of additional libraries and tools that expand its capabilities.
This document discusses the rapid growth of digital data and the challenges of analyzing large, unstructured datasets. It notes that in just one week in 2000, the Sloan Digital Sky Survey collected more data than had been collected in all of astronomy previously. Today, the Large Hadron Collider generates 40 terabytes per second and Twitter generates over 1 terabyte of tweets daily. By 2013, annual internet traffic was predicted to reach 667 exabytes. Hadoop provides a framework to analyze these vast and diverse datasets by distributing processing across commodity clusters close to where the data is stored.
This document provides an overview of big data concepts including what big data is, how it is used, and common tools involved. It defines big data as a cluster of technologies like Hadoop, HDFS, and HCatalog used for fetching, processing, and visualizing large datasets. MapReduce and Hadoop clusters are described as common processing techniques. Example use cases mentioned include business intelligence. Resources for getting started with tools like Hortonworks, CloudEra, and examples of MapReduce jobs are also provided.
Richard McDougall discusses trends in big data and frameworks for building big data applications. He outlines the growth of data, how big data is driving real-world benefits, and early adopter industries. McDougall also summarizes batch processing frameworks like Hadoop and Spark, graph processing frameworks like Pregel, and real-time processing frameworks like Storm. Finally, he discusses interactive processing frameworks such as Hive, Impala, and Shark and how to unify the big data platform using virtualization.
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
VMworld 2013
Jayanth Gummaraju, VMware
Sasha Kipervarg, Identified, Inc.
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets.
2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance.
3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.
This document provides an introduction and overview of the DBM630: Data Mining and Data Warehousing course. It outlines the course syllabus, textbooks, assessment tasks, schedule, prerequisites, and introduces concepts related to data mining and data warehousing including definitions, processes, applications, and evolution of database technology. The goal of the course is to teach students about data warehousing, data mining techniques such as association rule mining, classification, clustering, and current trends in the field.
This document provides an overview of bio big data and related technologies. It discusses what big data is and why bio big data is necessary given the large size of genomic data sets. It then outlines and describes Hadoop, Spark, machine learning, and streaming in the context of bio big data. For Hadoop, it explains HDFS, MapReduce, and the Hadoop ecosystem. For Spark, it covers RDDs, Spark SQL, MLlib, and Spark Streaming. The document is intended as an introduction to key concepts and tools for working with large biological data sets.
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemTreasure Data, Inc.
The document discusses four common problems encountered when building a DIY big data analytics system: 1) how to collect and store data, 2) how to query data, 3) how different users access query results, and 4) how to scale the system. It introduces Treasure Data as a solution that handles all these problems, allowing users to collect, store, query, access, and scale their data easily without having to manage infrastructure. Treasure Data provides analytics as a service using Hadoop and has tools that support data collection, querying, sharing results between different roles, and automatic scaling as more data and queries are added.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
Red Hat - Presentation at Hortonworks Booth - Strata 2014Hortonworks
This document discusses using Red Hat JBoss Middleware and Hortonworks to enable a modern data architecture. It provides an overview of Red Hat and JBoss Middleware and describes three use cases: 1) combining data from Hadoop with traditional sources using data virtualization, 2) federating across geographically distributed Hadoop clusters with data security, and 3) creating virtual data marts for a Hadoop data lake.
Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and evaluating the performance and energy efficiency of different hardware platforms for big data workloads.
The document provides an introduction to the concepts of big data and how it can be analyzed. It discusses how traditional tools cannot handle large data files exceeding gigabytes in size. It then introduces the concepts of distributed computing using MapReduce and the Hadoop framework. Hadoop makes it possible to easily store and process very large datasets across a cluster of commodity servers. It also discusses programming interfaces like Hive and Pig that simplify writing MapReduce programs without needing to use Java.
The document provides an overview of Hadoop and the Hadoop ecosystem. It discusses the history of Hadoop, how big data is defined in terms of volume, velocity, variety and veracity. It then explains what Hadoop is, the core components of HDFS and MapReduce, how Hadoop is used for distributed processing of large datasets, and how Hadoop compares to traditional RDBMS. The document also outlines other tools in the Hadoop ecosystem like Pig, Hive, HBase and gives a brief demo.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
This document discusses big data analytics using Hadoop. It provides an overview of loading clickstream data from websites into Hadoop using Flume and refining the data with MapReduce. It also describes how Hive and HCatalog can be used to query and manage the data, presenting it in a SQL-like interface. Key components and processes discussed include loading data into a sandbox, Flume's architecture and data flow, using MapReduce for parallel processing, how HCatalog exposes Hive metadata, and how Hive allows querying data using SQL queries.
The document describes an ultra low-power microprocessor called RuChip being developed by a company in Singapore. It aims to solve the problem of inefficient search servers by developing a specialized processor optimized for search applications. The key parts of the processor are a general purpose block based on ARM/Atom cores for low power and a network processing block to optimize bottlenecks in search engines. The company aims to validate the design through a proof-of-concept and target the growing search server market led by companies like Google and Yahoo.
TIBCO Advanced Analytics Meetup (TAAM) - June 2015Bipin Singh
This document summarizes a TIBCO Advanced Analytics meetup. It includes an agenda for presentations on TIBCO Analytics and data science, predictive analytics using TERR expressions, real-time analytics, APIs, and a question/answer wrap-up session. It also provides overviews of the Spotfire platform for data visualization and analytics, Spotfire capabilities for accessing and preparing data from various sources, and supported data sources.
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
This document discusses how Orbitz is using Hadoop to store and process large amounts of log and transaction data in a scalable and cost-effective way. It outlines how Hadoop enables applications like recommendations, page performance tracking, and user segmentation. The goal is to integrate Hadoop with their existing enterprise data warehouse to provide a unified view of data and leverage existing business intelligence tools. Examples of processing pipelines and use cases for web analytics, beta, and click data are provided.
Apache Spark is an open source big data processing framework that is faster than Hadoop, easier to use, and supports more types of analytics. It provides high-level APIs, can run computations directly in memory for faster performance, and supports a variety of data processing workloads including SQL queries, streaming data, machine learning, and graph processing. Spark also has a large ecosystem of additional libraries and tools that expand its capabilities.
This document discusses the rapid growth of digital data and the challenges of analyzing large, unstructured datasets. It notes that in just one week in 2000, the Sloan Digital Sky Survey collected more data than had been collected in all of astronomy previously. Today, the Large Hadron Collider generates 40 terabytes per second and Twitter generates over 1 terabyte of tweets daily. By 2013, annual internet traffic was predicted to reach 667 exabytes. Hadoop provides a framework to analyze these vast and diverse datasets by distributing processing across commodity clusters close to where the data is stored.
This document provides an overview of big data concepts including what big data is, how it is used, and common tools involved. It defines big data as a cluster of technologies like Hadoop, HDFS, and HCatalog used for fetching, processing, and visualizing large datasets. MapReduce and Hadoop clusters are described as common processing techniques. Example use cases mentioned include business intelligence. Resources for getting started with tools like Hortonworks, CloudEra, and examples of MapReduce jobs are also provided.
Richard McDougall discusses trends in big data and frameworks for building big data applications. He outlines the growth of data, how big data is driving real-world benefits, and early adopter industries. McDougall also summarizes batch processing frameworks like Hadoop and Spark, graph processing frameworks like Pregel, and real-time processing frameworks like Storm. Finally, he discusses interactive processing frameworks such as Hive, Impala, and Shark and how to unify the big data platform using virtualization.
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
VMworld 2013
Jayanth Gummaraju, VMware
Sasha Kipervarg, Identified, Inc.
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets.
2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance.
3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.
This document provides an introduction and overview of the DBM630: Data Mining and Data Warehousing course. It outlines the course syllabus, textbooks, assessment tasks, schedule, prerequisites, and introduces concepts related to data mining and data warehousing including definitions, processes, applications, and evolution of database technology. The goal of the course is to teach students about data warehousing, data mining techniques such as association rule mining, classification, clustering, and current trends in the field.
This document provides an overview of bio big data and related technologies. It discusses what big data is and why bio big data is necessary given the large size of genomic data sets. It then outlines and describes Hadoop, Spark, machine learning, and streaming in the context of bio big data. For Hadoop, it explains HDFS, MapReduce, and the Hadoop ecosystem. For Spark, it covers RDDs, Spark SQL, MLlib, and Spark Streaming. The document is intended as an introduction to key concepts and tools for working with large biological data sets.
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemTreasure Data, Inc.
The document discusses four common problems encountered when building a DIY big data analytics system: 1) how to collect and store data, 2) how to query data, 3) how different users access query results, and 4) how to scale the system. It introduces Treasure Data as a solution that handles all these problems, allowing users to collect, store, query, access, and scale their data easily without having to manage infrastructure. Treasure Data provides analytics as a service using Hadoop and has tools that support data collection, querying, sharing results between different roles, and automatic scaling as more data and queries are added.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
Red Hat - Presentation at Hortonworks Booth - Strata 2014Hortonworks
This document discusses using Red Hat JBoss Middleware and Hortonworks to enable a modern data architecture. It provides an overview of Red Hat and JBoss Middleware and describes three use cases: 1) combining data from Hadoop with traditional sources using data virtualization, 2) federating across geographically distributed Hadoop clusters with data security, and 3) creating virtual data marts for a Hadoop data lake.
Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and evaluating the performance and energy efficiency of different hardware platforms for big data workloads.
The document provides an introduction to the concepts of big data and how it can be analyzed. It discusses how traditional tools cannot handle large data files exceeding gigabytes in size. It then introduces the concepts of distributed computing using MapReduce and the Hadoop framework. Hadoop makes it possible to easily store and process very large datasets across a cluster of commodity servers. It also discusses programming interfaces like Hive and Pig that simplify writing MapReduce programs without needing to use Java.
The document provides an overview of Hadoop and the Hadoop ecosystem. It discusses the history of Hadoop, how big data is defined in terms of volume, velocity, variety and veracity. It then explains what Hadoop is, the core components of HDFS and MapReduce, how Hadoop is used for distributed processing of large datasets, and how Hadoop compares to traditional RDBMS. The document also outlines other tools in the Hadoop ecosystem like Pig, Hive, HBase and gives a brief demo.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
This document discusses big data analytics using Hadoop. It provides an overview of loading clickstream data from websites into Hadoop using Flume and refining the data with MapReduce. It also describes how Hive and HCatalog can be used to query and manage the data, presenting it in a SQL-like interface. Key components and processes discussed include loading data into a sandbox, Flume's architecture and data flow, using MapReduce for parallel processing, how HCatalog exposes Hive metadata, and how Hive allows querying data using SQL queries.
The document describes an ultra low-power microprocessor called RuChip being developed by a company in Singapore. It aims to solve the problem of inefficient search servers by developing a specialized processor optimized for search applications. The key parts of the processor are a general purpose block based on ARM/Atom cores for low power and a network processing block to optimize bottlenecks in search engines. The company aims to validate the design through a proof-of-concept and target the growing search server market led by companies like Google and Yahoo.
TIBCO Advanced Analytics Meetup (TAAM) - June 2015Bipin Singh
This document summarizes a TIBCO Advanced Analytics meetup. It includes an agenda for presentations on TIBCO Analytics and data science, predictive analytics using TERR expressions, real-time analytics, APIs, and a question/answer wrap-up session. It also provides overviews of the Spotfire platform for data visualization and analytics, Spotfire capabilities for accessing and preparing data from various sources, and supported data sources.
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
This document summarizes a presentation on tier-1 business intelligence (BI) in the world of big data. The presentation will cover Microsoft's BI capabilities at large scales, big data workloads from Yahoo and investment banks, Hadoop and the MapReduce framework, and extracting data out of big data systems into BI tools. It also shares a case study on Yahoo's advertising analytics platform that processes billions of rows daily from terabytes of data.
This document summarizes a presentation about achieving visibility and insight across OpenStack projects using an integration pilot powered by Wikidsmart. It discusses current challenges with a lack of integration between project content and silos of information. The Wikidsmart demo shows faceted search and tracing a patch across tools. Next steps proposed include offering a public and private Wikidsmart portal to OpenStack members to bridge internal and community content.
Achieving Visibility and Insight across OpenStack Projects.pptOpenStack Foundation
This document discusses achieving visibility and insight across OpenStack projects through an integration pilot powered by Wikidsmart. It describes current challenges with a lack of integration between project content and silos of information. The Wikidsmart demo shows faceted search and tracing a patch across tools. Potential next steps include offering a public Wikidsmart portal for searching and dashboards, and a private portal for corporate members to bridge internal and community content.
The document discusses big data and the need for real-time processing and in-depth analysis capabilities. It introduces Jubatus as a distributed computing framework that can handle these requirements. Jubatus allows for real-time analysis of large datasets like tweets and recommendations based on customer purchase histories. It can perform in-depth classification of data into topics or companies and has high throughput of 100,000 updates per second per server.
Impact of in-memory technology and SAP HANA on your business, IT, and careerVitaliy Rudnytskiy
The document discusses the impact of in-memory technology, specifically SAP HANA, on businesses, IT, and careers. It notes that SAP HANA is becoming a major part of SAP solutions and is reshaping how businesses and IT use SAP. It also impacts the skills required for different roles. The document provides examples of how SAP HANA enables faster analysis and real-time decision making compared to traditional databases. Finally, it outlines how various roles from business to development will need new skills to work with SAP HANA and in-memory solutions.
PatSeer™ is a Web 2.0 patent research platform that is poised to change how patent professionals collaborate on patent projects. It includes a global patent database comprising 60 million plus records along with many calculated fields for superior research.
Big data paris 2011 is cool florian douetteauIsCoolEnt
IsCool Entertainment is a social game publisher based in Paris, France. It is the #1 French publisher in terms of audience and revenue, with 450k daily active users and €9.1 million in revenue in 2010. It has 80 employees and publishes 4 live games on Facebook with over 2.8 million fans. To analyze its big data from 18 million daily user actions and 7 billion annual actions, IsCool uses a mixed approach, combining SaaS analytics platforms for common metrics, an internal data warehouse for detailed metrics and modeling, and data mining tools. It stores 10GB of daily log files in Hadoop and extracts 1GB to its columnar database for analytics.
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Big data? No. Big Decisions are What You WantStuart Miniman
This document summarizes a presentation about big data. It discusses what big data is, how it is transforming business intelligence, who is using big data, and how practitioners should proceed. It provides examples of how companies in different industries like media, retail, and healthcare are using big data to drive new revenue opportunities, improve customer experience, and predict equipment failures. The presentation recommends developing a big data strategy that involves evaluating opportunities, engaging stakeholders, planning projects, and continually executing and repeating the process.
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. In order
to meet this challenge, Twitter designed an end to end real-time stack consisting of DistributedLog, the distributed and replicated messaging system system, and Heron, the streaming system for real time computation. DistributedLog is a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s publish-subscribe system. Twitter Heron is the next generation streaming system built from ground up to address our scalability and reliability needs. Both the systems have been in production for nearly two years and is widely used at Twitter in a range of diverse applications such as search ingestion pipeline, ad analytics, image classification and more. These slides will describe Heron and DistributedLog in detail, covering a few use cases in-depth and sharing the operating experiences and challenges of running large-scale real time systems at scale.
This document discusses cloud computing and a case study of business intelligence (BI) on the cloud at Yahoo. It provides an overview of cloud computing and Yahoo's approach. It then details a specific case study of implementing BI solutions on the cloud at Yahoo, including the motivation, architecture, functional view, and screenshots. It concludes with the presenter's opinions on whether the cloud is ready for data warehousing and BI, the pros and cons of BI on the cloud, and the options they considered.
Google BigQuery is the future of Analytics! (Google Developer Conference)Rasel Rana
Google Developer Group (GDG) Sonargaon is a community based focused group for developers on Google and related technologies. I tried to cover a topic on Big Data & BigQuery which is the future of analytics.
NYC Data Web (static version) - A Semantic, Open Public Data Exchange for NYCJoel Natividad
The document introduces NYC DataWeb, a platform for integrating public data into NYC.gov. It was created by TCG Software and Revelytix/Spry to make city data easier for developers, publishers and citizens to access and use. The platform hides complexity through semantic technologies and uses an ontology and query framework to provide unified access to siloed city data sources. Its goals are to stimulate app development, encourage innovation and make usable data available now.
Embedded Analytics: The Next Mega-Wave of InnovationInside Analysis
This document provides an overview of an upcoming webinar hosted by Infobright. The webinar will feature a presentation by Susan Davis, VP of Marketing at Infobright, about how the company's technology enables real-time data analysis. Infobright offers a columnar database that provides fast analytics for large volumes of machine-generated data. Infobright's solutions help customers meet requirements for speed, flexibility, performance and low maintenance. Case studies will highlight how Infobright has helped telecom and mobile analytics companies like JDSU and Bango improve query response times, reduce data storage needs, and lower costs.
Hadoop World 2011: Architecting a Business-Critical Application in Hadoop - S...Cloudera, Inc.
NetApp is in the process of moving a petabyte-scale database of customer support information from a traditional relational data warehouse to a Hadoop-based application stack. This talk will explore the application requirements and the resulting hardware and software architecture. Particular attention will be paid to trade-offs in the storage stack, along with data on the various approaches considered, benchmarked, and the resulting final architecture. Attendees will learn a range of architectures available when contemplating a large Hadoop project and some of the process used by NetApp to choose amongst the alternatives.
The document discusses The Apache Way Done Right and the success of Hadoop. It provides an overview of Apache Hadoop, including that it is a set of open source projects that transforms commodity hardware into a reliable system for storing and analyzing large amounts of data. It also discusses how Hadoop originated from the Nutch project and was adopted by early users like Yahoo, Facebook, and Twitter to handle big data challenges. Examples are given of how Yahoo used Hadoop for applications like the Webmap and personalized homepages.
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceDenny Lee
Azure Cosmos DB is the industry's first globally distributed multi-model database service. Features of Cosmos DB include turn-key global distribution, elastic throughput and storage, multiple consistency models, and financially backed SLAs. As well, we are in preview for Table, Graph, and Spark Connector to Cosmos DB. Also includes healthcare scenarios!
Denny Lee introduced Azure DocumentDB, a fully managed NoSQL database service. DocumentDB provides elastic scaling of throughput and storage, global distribution with low latency reads and writes, and supports querying JSON documents with SQL and JavaScript. Common scenarios that benefit from DocumentDB include storing product catalogs, user profiles, sensor telemetry, and social graphs due to its ability to handle hierarchical and de-normalized data at massive scale.
SQL Server Integration Services Best PracticesDenny Lee
This is Thomas Kejser and my presentation at the Microsoft Business Intelligence Conference 2008 (October 2008) on SQL Server Integration Services Best Practices
SQL Server Reporting Services: IT Best PracticesDenny Lee
This is Lukasz Pawlowski and my presentation at the Microsoft Business Intelligence Conference 2008 (October 2008) on SQL Server Reporting Services: IT Best Practices
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
This is my 24 Hour of SQL PASS (September 2012) presentation on Introduction to Microsoft's Big Data Platform and Hadoop Primer. All known as Project Isotope and HDInsight.
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
This document discusses case studies using differential privacy to analyze sensitive data. It describes analyzing Windows Live user data to study web analytics and customer churn. Clinical researchers' perspectives on differential privacy were also examined. Researchers wanted unaffected statistics and the ability to access original data if needed. Future collaboration with OHSU aims to develop a healthcare template for applying differential privacy.
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
This is the PASS DW|BI virtual chapter webinar on SQL Server Reporting Services Disaster Recovery with Ayad Shammout and myself - hosted by Julie Koesmarno (@mssqlgirl)
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Denny Lee
This document discusses lessons learned from deploying large scale SQL Server Reporting Services (SSRS) environments based on customer scenarios. It covers the key aspects of success, scaling out the architecture, performance optimization, and troubleshooting. Scaling out involves moving report catalogs to dedicated servers and using a scale out deployment architecture. Performance is optimized through configurations like disabling report history and tuning memory settings. Troubleshooting utilizes logs, monitoring, and diagnosing issues like out of memory errors.
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
This is Nicholas Dritsas, Eric Jacobsen, and my 2007 SQL PASS Summit presentation on designing, building, and maintaining large Analysis Services cubes
SQLCAT: A Preview to PowerPivot Server Best PracticesDenny Lee
The document discusses SQL Server Customer Advisory Team (SQLCAT) and their work on the largest and most complex SQL Server projects worldwide. It also discusses SQLCAT's sharing of technical content and driving of product requirements back into SQL Server based on customer needs. The document promotes an upcoming SQL Server Clinic where experts will be available to answer questions about architecting and designing future applications.
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
Denny Lee, Technology Evangelist with Databricks, will demonstrate how easily many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily using Apache Spark. This introductory level jump start will focus on user scenarios; it will be demo heavy and slide light!
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Things to Consider When Choosing a Website Developer for your Website | FODUU
Yahoo! TAO Case Study Excerpt
1. Yahoo! TAO Case Study
Excerpt of “Tier-1 BI in the World of Big Data” presentation
at PASS 2011 Conference
by Thomas Kejser, Denny Lee, and Kenneth Lieu
2. Yahoo! TAO Business Challenge
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
3. Yahoo! TAO Business Challenge
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
4. Yahoo! TAO Business Challenge
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, gender, age, location to make the
exchange work as efficiently and
effectively as possible
5. Yahoo! TAO Technical Requirements
Visitors to Yahoo! Branded sites: 680,000,000
Ad Impressions: 3,500,000,000 (per day)
Rows Loaded: 464,000,000,000 (per qtr)
Refresh Frequency: Hourly
Average Query Time: <10 seconds
6. Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Aggregation & ETL Data Archive & Staging BI Server
Oracle 11G RAC SQL Server Analysis
Hadoop
Services 2008 R2
2PB
cluster
File 1
Partition 1 Partition 1
File 2
Partition 2 Partition 2
File N Partition Partition
N N
1.2TB 135GB/day
/day compressed 24TB
Cube
/qtr
7. Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
Adhoc Query/Visualization
24TB
Tableau Desktop 6 Cube
/qtr
Avg Query Time:
6 secs
464B rows of
event level data
/qtr
BI Query Servers
SQL Server Analysis
Services 2008 R2
Optimization Application
Dimensions: 24
Custom J2EE App
•
• Attributes: 247
Avg Query Time: • Measures: 207
2 secs
8. Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
eCPMs (revenue)
has increased!
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
9. Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”