This document summarizes IBM's Big Data platform called InfoSphere BigInsights and InfoSphere Streams. It discusses how the platform can integrate and manage large volumes, varieties and velocities of data, apply advanced analytics to data in its native form, and enable visualization and development of new analytic applications. It also describes the key components of the BigInsights platform including Hadoop, data integration, governance and various accelerators.
Calpont CTO Jim Tommaney provides an overview InfiniDB 3, Calpont’s analytic data platform.
Discussion Topics
•How InfiniDB is architected for Big Data analytics
•How InfiniDB is provisioned for Amazon EC2 with an AMI
•How to quickly create a small or large cluster
•How InfiniDB’s parallel load capabilities deliver linear load scaling
Richard McDougall discusses trends in big data and frameworks for building big data applications. He outlines the growth of data, how big data is driving real-world benefits, and early adopter industries. McDougall also summarizes batch processing frameworks like Hadoop and Spark, graph processing frameworks like Pregel, and real-time processing frameworks like Storm. Finally, he discusses interactive processing frameworks such as Hive, Impala, and Shark and how to unify the big data platform using virtualization.
When a relational database doesn't work, a graph database may provide more flexibility. Franz uses a graph database called AllegroGraph for semantic analysis of text data. It extracts entities, concepts, and relationships and links them to external data sources. This allows for complex queries over distributed data. Franz applies this approach to analyze news articles and social media for defense customers. It extracts over 150 triples from each text and links them to profiles of politicians and other domain concepts. This semantic representation enables flexible querying and insight generation over distributed textual data.
Masahiro Nakagawa presented on Treasure Data and its integration with Heroku. Treasure Data is a big data analytics company founded by Japanese entrepreneurs that provides cloud-based data warehousing and analytics services. It collects data from various sources using open source log collectors and stores large volumes of data in its columnar datastore. Treasure Data integrates with Heroku so users can easily add its big data analytics capabilities to applications hosted on Heroku.
This document summarizes a workshop on data management. It outlines the typical research lifecycle including proposal planning, project start up, data collection, analysis, sharing, and end of project. It discusses support for researchers within areas like data mining, curation, and preservation. It also discusses support from outside through infrastructure, policy, and best practices. Finally, it identifies 9 key skills gaps for librarians in advising researchers on data management tasks.
The document summarizes Terapot, a commercial email archiving system that uses Hadoop. It discusses how Terapot addresses the challenges of archiving massive amounts of email data at low cost and high scalability. Terapot leverages Hadoop's distributed architecture for crawling, indexing, and searching emails across thousands of servers. Key components include batch processing for archiving, real-time indexing, distributed search, and analysis tools that mine the archived email data.
Calpont CTO Jim Tommaney provides an overview InfiniDB 3, Calpont’s analytic data platform.
Discussion Topics
•How InfiniDB is architected for Big Data analytics
•How InfiniDB is provisioned for Amazon EC2 with an AMI
•How to quickly create a small or large cluster
•How InfiniDB’s parallel load capabilities deliver linear load scaling
Richard McDougall discusses trends in big data and frameworks for building big data applications. He outlines the growth of data, how big data is driving real-world benefits, and early adopter industries. McDougall also summarizes batch processing frameworks like Hadoop and Spark, graph processing frameworks like Pregel, and real-time processing frameworks like Storm. Finally, he discusses interactive processing frameworks such as Hive, Impala, and Shark and how to unify the big data platform using virtualization.
When a relational database doesn't work, a graph database may provide more flexibility. Franz uses a graph database called AllegroGraph for semantic analysis of text data. It extracts entities, concepts, and relationships and links them to external data sources. This allows for complex queries over distributed data. Franz applies this approach to analyze news articles and social media for defense customers. It extracts over 150 triples from each text and links them to profiles of politicians and other domain concepts. This semantic representation enables flexible querying and insight generation over distributed textual data.
Masahiro Nakagawa presented on Treasure Data and its integration with Heroku. Treasure Data is a big data analytics company founded by Japanese entrepreneurs that provides cloud-based data warehousing and analytics services. It collects data from various sources using open source log collectors and stores large volumes of data in its columnar datastore. Treasure Data integrates with Heroku so users can easily add its big data analytics capabilities to applications hosted on Heroku.
This document summarizes a workshop on data management. It outlines the typical research lifecycle including proposal planning, project start up, data collection, analysis, sharing, and end of project. It discusses support for researchers within areas like data mining, curation, and preservation. It also discusses support from outside through infrastructure, policy, and best practices. Finally, it identifies 9 key skills gaps for librarians in advising researchers on data management tasks.
The document summarizes Terapot, a commercial email archiving system that uses Hadoop. It discusses how Terapot addresses the challenges of archiving massive amounts of email data at low cost and high scalability. Terapot leverages Hadoop's distributed architecture for crawling, indexing, and searching emails across thousands of servers. Key components include batch processing for archiving, real-time indexing, distributed search, and analysis tools that mine the archived email data.
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
Documentum xPlore provides an integrated Search facility for the Documentum Content Server. The standalone search engine is based on EMC's xDB (Native XML database) and Lucene. In this talk we will introduce xPlore and some of its key components and capabilities. These include aspects of a tight integration of Lucene with the XML database: xQuery translation and optimization into Lucene query/API's as well as transactional update Lucene). In addition, xPlore is being deployed aggressively into virtualized environments (both disk I/O and VM). We cover some performance results and tuning tips in these areas.
Introduction to Gruter and Gruter's BigData PlatformGruter
Gruter specializes in helping companies develop successful Big Data environments by designing carefully-modeled best-fit data platform solutions. Gruter's expertise extends across the full data life cycle, ensuring prescient architecture, robust build, timely deployment and simple operation and maintenance. Through a spirit of partnership and collaboration, Gruter provides its clients with the tools, know-how and support needed to put Big Data to work for immediate bottom-line outcomes.
Streaming Hadoop for Enterprise AdoptionDATAVERSITY
VoltDB provides a streaming solution to simplify Hadoop for enterprise adoption by addressing common challenges. It allows for real-time decision making and analytics on high-quality data by reducing costs, data risks, and total pipeline times compared to traditional Hadoop implementations that are complex, expensive and slow. VoltDB is a high-performance in-memory database that can automatically scale out on commodity servers to enable faster, better and cheaper real-time insights from streaming big data.
This document discusses how Amazon Web Services (AWS) can be used for data-driven innovation. It provides an overview of AWS computing, storage, database and analytics services that can be used to collect, compute and collaborate on data. Specific services highlighted include S3, DynamoDB, EMR and EC2. Use cases discussed include log analysis, risk analysis, fraud prevention and market trend analysis. It also covers how AWS services allow for scalable, flexible and cost-effective infrastructure.
Cetas Analytics as a Service for Predictive AnalyticsJ. David Morris
This document discusses how predictive analytics using big data can lead to successful recommendations and revenue maximization. It describes trends in data growth, the value of data analytics exceeding hardware costs, and how a unified analytics cloud platform can simplify infrastructure and optimize resources. Sample predictive analytics applications are outlined for industries like ecommerce, mobile, advertising, gaming, and IT, with the goal of revenue maximization and user engagement through recommendation engines and targeted placements. The cloudification of predictive analytics as an analytics-as-a-service approach is presented as the logical conclusion to fully leverage big data.
This document discusses how predictive analytics using big data leads to successful recommendations and revenue maximization. It outlines key trends like the growth of new data sources and analyzes how companies are using predictive analytics in applications like ecommerce, mobile, advertising, and gaming to optimize customer engagement and maximize profits. The document advocates taking predictive analytics to its logical conclusion through cloud-based analytics-as-a-service and leveraging big data to directly monetize insights from predictive modeling.
This document discusses Apache Hadoop, its current state and future direction. It provides an overview of Hadoop as an open source platform for storing and analyzing large amounts of data across distributed systems. The document outlines Hortonworks' vision of making Hadoop an enterprise-ready platform that can power data-driven businesses and unify both traditional and big data analytics methods. It also announces an upcoming Hadoop conference in June 2012 with sessions showcasing real-world Hadoop uses.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
In this slidecast, Richard Treadway and Rich Seger from NetApp discuss the company's storage solutions for Big Data and HPC. The company's HPC solutions for Lustre support massive performance and storage density without sacrificing efficiency.
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
The document discusses the 3 V's of big data: volume, velocity, and variety. It provides examples of how each V impacts data analysis and storage. It also discusses how text data has been a major driver of big data growth and challenges. The key challenges are processing large and diverse datasets quickly enough to keep up with real-time data streams and demands.
The document discusses big data and the need for real-time processing and in-depth analysis capabilities. It introduces Jubatus as a distributed computing framework that can handle these requirements. Jubatus allows for real-time analysis of large datasets like tweets and recommendations based on customer purchase histories. It can perform in-depth classification of data into topics or companies and has high throughput of 100,000 updates per second per server.
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
Big data analytics has evolved beyond batch processing with Hadoop to extract intelligence from data streams in real time. New technologies preserve data locality, allow real-time processing and streaming, support complex analytics functions, provide rich data models and queries, optimize data flow and queries, and leverage CPU caches and distributed memory for speed. Frameworks like Spark and Shark improve on MapReduce with in-memory computation and dynamic resource allocation.
The document discusses trends in big data and data management. It notes that data volume, velocity, variety, and value are increasing dramatically. This rapid growth is challenging IT to manage and analyze more complex data relationships in real time and at large scale. The document also discusses how new consumption models like cloud computing and storage virtualization can help reduce costs and better manage the explosion of data replication. It introduces Hitachi's accelerated flash storage and new HUS VM entry-level enterprise storage system to address these big data challenges.
The New Alchemy: Turning Data into Gold
Developers are leading the charge to turn consumer behavior into profitable solutions. By accessing and analyzing the explosion of data from consumer activities, any developer can create the personalized, relevant products and services that customers demand and merchants urgently need. We will discuss how to acquire, store, and mine information, and how to design analytics-focused software and build data-driven software engines.
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
This document discusses the challenges of distributed information retrieval. It covers four main modules: crawling, indexing, query processing, and caching. For each module, it discusses issues like partitioning tasks across servers, dealing with failures, communication between servers, and external factors. It provides more details on partitioning approaches for crawling and indexing, as well as the benefits and challenges of document partitioning during indexing.
1) A database is a collection of related data organized so that it can be easily accessed, managed, and updated. Database management systems (DBMS) allow users to create databases and access data in a controlled manner.
2) The database approach offers advantages over file processing systems like reduced data redundancy, improved data integrity, and shared access to data. Popular DBMSs include Microsoft Access, MySQL, Oracle, SQL Server, and IBM DB2.
3) Database administrators and analysts work to design efficient databases, define user access privileges, monitor performance and security, and ensure the reliability of data through backups and recovery procedures.
Big data and security involves managing huge amounts of data from various sources. Some key points:
- The amount of data generated annually is expected to grow exponentially to over 6.6 zettabytes by 2016. Individual companies like Facebook generate over 400 terabytes of data per day.
- Big data comes from a variety of structured and unstructured sources, and is distributed across multiple locations and systems. Both batch-based and real-time streaming approaches are used.
- Effectively organizing, analyzing, and deriving value from large, diverse datasets requires new approaches that can handle different data types and structures from many online and offline sources.
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
Fred Oh's presentation for SNW Spring, Monday 4/2/12, 1:00–1:45PM
Unstructured data growth is in an explosive state, and has no signs of slowing down. Costs continue to rise along with new regulations mandating longer data retention. Moreover, disparate silos, multivendor storage assets and less than optimal use of existing assets have all contributed to ‘accidental architectures.’ And while they can be key drivers for organizations to explore incremental, innovative solutions to their data challenges, they may provide only short-term gain. Join us for this session as we outline the business benefits of a truly unified, integrated platform to manage all block, file and object data that allows enterprises can make the most out of their storage resources. We explore the benefits of an integrated approach to multiprotocol file sharing, intelligent file tiering, federated search and active archiving; how to simplify and reduce the need for backup without the risk of losing availability; and the economic benefits of an integrated architecture approach that leads to lowering TCSO by 35% or more.
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Odinot Stanislas
This document discusses protecting big data with Intel technologies. It summarizes Intel's Distribution for Apache Hadoop software, which includes encryption and role-based access control features. The software provides an encryption framework that extends Hadoop's compression codec and establishes a common encryption API. It also allows different key storage systems to integrate for key management. Performance tests show Intel AES-NI instructions accelerate encryption and decryption, providing up to 19.8x faster decryption compared to non-AES-NI.
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
Documentum xPlore provides an integrated Search facility for the Documentum Content Server. The standalone search engine is based on EMC's xDB (Native XML database) and Lucene. In this talk we will introduce xPlore and some of its key components and capabilities. These include aspects of a tight integration of Lucene with the XML database: xQuery translation and optimization into Lucene query/API's as well as transactional update Lucene). In addition, xPlore is being deployed aggressively into virtualized environments (both disk I/O and VM). We cover some performance results and tuning tips in these areas.
Introduction to Gruter and Gruter's BigData PlatformGruter
Gruter specializes in helping companies develop successful Big Data environments by designing carefully-modeled best-fit data platform solutions. Gruter's expertise extends across the full data life cycle, ensuring prescient architecture, robust build, timely deployment and simple operation and maintenance. Through a spirit of partnership and collaboration, Gruter provides its clients with the tools, know-how and support needed to put Big Data to work for immediate bottom-line outcomes.
Streaming Hadoop for Enterprise AdoptionDATAVERSITY
VoltDB provides a streaming solution to simplify Hadoop for enterprise adoption by addressing common challenges. It allows for real-time decision making and analytics on high-quality data by reducing costs, data risks, and total pipeline times compared to traditional Hadoop implementations that are complex, expensive and slow. VoltDB is a high-performance in-memory database that can automatically scale out on commodity servers to enable faster, better and cheaper real-time insights from streaming big data.
This document discusses how Amazon Web Services (AWS) can be used for data-driven innovation. It provides an overview of AWS computing, storage, database and analytics services that can be used to collect, compute and collaborate on data. Specific services highlighted include S3, DynamoDB, EMR and EC2. Use cases discussed include log analysis, risk analysis, fraud prevention and market trend analysis. It also covers how AWS services allow for scalable, flexible and cost-effective infrastructure.
Cetas Analytics as a Service for Predictive AnalyticsJ. David Morris
This document discusses how predictive analytics using big data can lead to successful recommendations and revenue maximization. It describes trends in data growth, the value of data analytics exceeding hardware costs, and how a unified analytics cloud platform can simplify infrastructure and optimize resources. Sample predictive analytics applications are outlined for industries like ecommerce, mobile, advertising, gaming, and IT, with the goal of revenue maximization and user engagement through recommendation engines and targeted placements. The cloudification of predictive analytics as an analytics-as-a-service approach is presented as the logical conclusion to fully leverage big data.
This document discusses how predictive analytics using big data leads to successful recommendations and revenue maximization. It outlines key trends like the growth of new data sources and analyzes how companies are using predictive analytics in applications like ecommerce, mobile, advertising, and gaming to optimize customer engagement and maximize profits. The document advocates taking predictive analytics to its logical conclusion through cloud-based analytics-as-a-service and leveraging big data to directly monetize insights from predictive modeling.
This document discusses Apache Hadoop, its current state and future direction. It provides an overview of Hadoop as an open source platform for storing and analyzing large amounts of data across distributed systems. The document outlines Hortonworks' vision of making Hadoop an enterprise-ready platform that can power data-driven businesses and unify both traditional and big data analytics methods. It also announces an upcoming Hadoop conference in June 2012 with sessions showcasing real-world Hadoop uses.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
In this slidecast, Richard Treadway and Rich Seger from NetApp discuss the company's storage solutions for Big Data and HPC. The company's HPC solutions for Lustre support massive performance and storage density without sacrificing efficiency.
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
The document discusses the 3 V's of big data: volume, velocity, and variety. It provides examples of how each V impacts data analysis and storage. It also discusses how text data has been a major driver of big data growth and challenges. The key challenges are processing large and diverse datasets quickly enough to keep up with real-time data streams and demands.
The document discusses big data and the need for real-time processing and in-depth analysis capabilities. It introduces Jubatus as a distributed computing framework that can handle these requirements. Jubatus allows for real-time analysis of large datasets like tweets and recommendations based on customer purchase histories. It can perform in-depth classification of data into topics or companies and has high throughput of 100,000 updates per second per server.
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
Big data analytics has evolved beyond batch processing with Hadoop to extract intelligence from data streams in real time. New technologies preserve data locality, allow real-time processing and streaming, support complex analytics functions, provide rich data models and queries, optimize data flow and queries, and leverage CPU caches and distributed memory for speed. Frameworks like Spark and Shark improve on MapReduce with in-memory computation and dynamic resource allocation.
The document discusses trends in big data and data management. It notes that data volume, velocity, variety, and value are increasing dramatically. This rapid growth is challenging IT to manage and analyze more complex data relationships in real time and at large scale. The document also discusses how new consumption models like cloud computing and storage virtualization can help reduce costs and better manage the explosion of data replication. It introduces Hitachi's accelerated flash storage and new HUS VM entry-level enterprise storage system to address these big data challenges.
The New Alchemy: Turning Data into Gold
Developers are leading the charge to turn consumer behavior into profitable solutions. By accessing and analyzing the explosion of data from consumer activities, any developer can create the personalized, relevant products and services that customers demand and merchants urgently need. We will discuss how to acquire, store, and mine information, and how to design analytics-focused software and build data-driven software engines.
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
This document discusses the challenges of distributed information retrieval. It covers four main modules: crawling, indexing, query processing, and caching. For each module, it discusses issues like partitioning tasks across servers, dealing with failures, communication between servers, and external factors. It provides more details on partitioning approaches for crawling and indexing, as well as the benefits and challenges of document partitioning during indexing.
1) A database is a collection of related data organized so that it can be easily accessed, managed, and updated. Database management systems (DBMS) allow users to create databases and access data in a controlled manner.
2) The database approach offers advantages over file processing systems like reduced data redundancy, improved data integrity, and shared access to data. Popular DBMSs include Microsoft Access, MySQL, Oracle, SQL Server, and IBM DB2.
3) Database administrators and analysts work to design efficient databases, define user access privileges, monitor performance and security, and ensure the reliability of data through backups and recovery procedures.
Big data and security involves managing huge amounts of data from various sources. Some key points:
- The amount of data generated annually is expected to grow exponentially to over 6.6 zettabytes by 2016. Individual companies like Facebook generate over 400 terabytes of data per day.
- Big data comes from a variety of structured and unstructured sources, and is distributed across multiple locations and systems. Both batch-based and real-time streaming approaches are used.
- Effectively organizing, analyzing, and deriving value from large, diverse datasets requires new approaches that can handle different data types and structures from many online and offline sources.
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
Fred Oh's presentation for SNW Spring, Monday 4/2/12, 1:00–1:45PM
Unstructured data growth is in an explosive state, and has no signs of slowing down. Costs continue to rise along with new regulations mandating longer data retention. Moreover, disparate silos, multivendor storage assets and less than optimal use of existing assets have all contributed to ‘accidental architectures.’ And while they can be key drivers for organizations to explore incremental, innovative solutions to their data challenges, they may provide only short-term gain. Join us for this session as we outline the business benefits of a truly unified, integrated platform to manage all block, file and object data that allows enterprises can make the most out of their storage resources. We explore the benefits of an integrated approach to multiprotocol file sharing, intelligent file tiering, federated search and active archiving; how to simplify and reduce the need for backup without the risk of losing availability; and the economic benefits of an integrated architecture approach that leads to lowering TCSO by 35% or more.
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Odinot Stanislas
This document discusses protecting big data with Intel technologies. It summarizes Intel's Distribution for Apache Hadoop software, which includes encryption and role-based access control features. The software provides an encryption framework that extends Hadoop's compression codec and establishes a common encryption API. It also allows different key storage systems to integrate for key management. Performance tests show Intel AES-NI instructions accelerate encryption and decryption, providing up to 19.8x faster decryption compared to non-AES-NI.
Big Data is growing rapidly in terms of volume, variety, and velocity. The cloud is well-suited to handle Big Data challenges by providing elastic and scalable infrastructure, which optimizes resources and reduces costs compared to traditional IT. In the cloud, users can collect, store, analyze and share large amounts of data without upfront investment, and scale easily as needs change. Real-world examples show how companies in industries like banking, retail, and advertising are using the cloud's Big Data services to gain insights from large datasets.
Big data and cloud computing are closely intertwined. The cloud is well-suited to handle big data challenges by providing massive scalability, flexible pay-as-you-go pricing, and removing the undifferentiated heavy lifting of managing infrastructure. This allows companies to focus on analyzing large and complex datasets. Examples show how companies use Amazon Web Services to collect petabytes of data from sources like sensors and social media, process it using services like EMR, and gain insights for applications in various industries.
Big Data refers to very large data sets that are too large for traditional data management tools to handle efficiently. It involves data that is highly varied in type, includes structured and unstructured data, and is created at high volume and velocity. Analyzing big data requires scaling out to many commodity servers rather than scaling up on expensive proprietary hardware. It also requires open source software frameworks and platforms rather than traditional proprietary solutions. Big data analytics can analyze raw, unstructured data from many sources to derive insights, while traditional analytics are limited to structured data from known sources and require data to be aggregated into a stable data model first.
Embedded Analytics: The Next Mega-Wave of InnovationInside Analysis
This document provides an overview of an upcoming webinar hosted by Infobright. The webinar will feature a presentation by Susan Davis, VP of Marketing at Infobright, about how the company's technology enables real-time data analysis. Infobright offers a columnar database that provides fast analytics for large volumes of machine-generated data. Infobright's solutions help customers meet requirements for speed, flexibility, performance and low maintenance. Case studies will highlight how Infobright has helped telecom and mobile analytics companies like JDSU and Bango improve query response times, reduce data storage needs, and lower costs.
This document discusses big data solutions and analytics. It defines big data in terms of volume, velocity, and variety of data. It contrasts big data analytics with traditional business intelligence, noting that big data looks for untapped insights rather than dashboards. It also provides examples of scalable big data platform architectures and advanced analytics capabilities. Finally, it outlines Anexinet's big data offerings including strategy, starter solutions, projects, and partnerships.
Microsoft StreamInsight, part of the recent SQL Server 2008 R2 release, is a new platform for building rich applications that can process high volumes of event stream data with near-zero latency.
Mark Simms of Microsoft's SQLCAT will demonstrate the core skill sets and technologies needed to deliver StreamInsight enabled solutions, and discuss some of the core scenarios.
Mark will provide a detailed walkthrough of the three major components of StreamInsight: input and output adapters, the StreamInsight engine runtime, and the semantics of the continuous standing queries hosted in the StreamInsight engine.
This presentation includes hands-on demos, including building out a real-time data processing solution interacting with SQL Server and Sharepoint.
You will learn:
• The new capabilities StreamInsight brings to data processing and analytics, unlocking the ability to extract real time business intelligence from streaming data.
• How StreamInsight interacts with and compliments other components of SQL Server and the rest of the Microsoft technology stack.
• How to ramp up on the skills and technology necessary to build out end to end solutions leveraging streaming data sources.
Using real time big data analytics for competitive advantageAmazon Web Services
Many organisations find it challenging to successfully perform real-time data analytics using their own on premise IT infrastructure. Building a system that can adapt and scale rapidly to handle dramatic increases in transaction loads can potentially be quite a costly and time consuming exercise.
Most of the time, infrastructure is under-utilised and it’s near impossible for organisations to forecast the amount of computing power they will need in the future to serve their customers and suppliers.
To overcome these challenges, organisations can instead utilise the cloud to support their real-time data analytics activities. Scalable, agile and secure, cloud-based infrastructure enables organisations to quickly spin up infrastructure to support their data analytics projects exactly when it is needed. Importantly, they can ‘switch off’ infrastructure when it is not.
BluePi Consulting and Amazon Web Services (AWS) are giving you the opportunity to discover how organisations are using real time data analytics to gain new insights from their information to improve the customer experience and drive competitive advantage.
Big Data and Implications on Platform ArchitectureOdinot Stanislas
This document discusses big data and its implications for data center architecture. It provides examples of big data use cases in telecommunications, including analyzing calling patterns and subscriber usage. It also discusses big data analytics for applications like genome sequencing, traffic modeling, and spam filtering on social media feeds. The document outlines necessary characteristics for data platforms to support big data workloads, such as scalable compute, storage, networking and high memory capacity.
Intel Cloud summit: Big Data by Nick KnupfferIntelAPAC
1. Big data is growing rapidly in terms of volume, velocity, and variety.
2. Intel is well positioned to help organizations address big data challenges through its software stack, platforms, and by investing in new technologies.
3. Intel is committed to fostering the growth of the big data ecosystem through broad collaboration with partners.
This document discusses a webinar on data lakes and analytics hosted by Karlos Correia and Claudio Chiba, AWS solutions architects for the public sector. The agenda covers what a data lake is, why organizations use data lakes, how data lakes expand traditional analytics approaches, and the benefits of data lakes such as centralized data storage and schema-on-read capabilities. Amazon S3 and AWS analytics services are positioned as enabling technologies for building data lakes.
Big data? No. Big Decisions are What You WantStuart Miniman
This document summarizes a presentation about big data. It discusses what big data is, how it is transforming business intelligence, who is using big data, and how practitioners should proceed. It provides examples of how companies in different industries like media, retail, and healthcare are using big data to drive new revenue opportunities, improve customer experience, and predict equipment failures. The presentation recommends developing a big data strategy that involves evaluating opportunities, engaging stakeholders, planning projects, and continually executing and repeating the process.
Big Data on AWS
The document discusses how the cloud is well suited to support big data applications and analytics. It notes that the cloud provides elastic, on-demand infrastructure that optimizes resources and reduces costs compared to traditional IT. This allows organizations to focus on analyzing and using big data rather than managing infrastructure. The cloud also enables the collection and storage of massive datasets. Examples are given of companies using cloud-based big data for applications like risk analysis, recommendations, and targeted advertising.
This document discusses how the cloud is well suited to address the challenges of big data. It notes that big data sets are getting larger and more complex, requiring new tools and approaches. The cloud optimizes precious IT resources by enabling elastic scaling, global accessibility, easy experimentation, and reducing costs. The cloud empowers users to balance costs and time. Several real-world examples are provided, such as banks using the cloud to perform Monte Carlo simulations and retailers using it for targeted recommendations and click stream analysis.
The document summarizes a presentation on evolving a new analytical platform. It discusses defining the platform to include tools for the whole research cycle beyond just business intelligence (BI), with SQL Server 2008 R2 as an example of defining the platform. It also discusses what is working with existing platforms and what is still missing, including the need for more scalable data storage and processing.
Big data refers to the massive amounts of information created every day from various sources. Some key facts about big data include:
- Every two days now we create as much data as we did from the beginning of civilization until 2003.
- Technologies to handle big data must be able to process petabytes and exabytes of data from a variety of structured and unstructured sources in real-time.
- Analyzing big data can provide valuable insights into areas like smart cities, healthcare, retail and manufacturing by improving operations and decision making.
However, big data also presents challenges around its massive scale, rapid growth, heterogeneity and real-time processing requirements that differ from traditional data warehousing.
1) The document discusses big data strategies and technologies including Oracle's big data solutions. It describes Oracle's big data appliance which is an integrated hardware and software platform for running Apache Hadoop.
2) Key technologies that enable deeper analytics on big data are discussed including advanced analytics, data mining, text mining and Oracle R. Use cases are provided in industries like insurance, travel and gaming.
3) An example use case of a "smart mall" is described where customer profiles and purchase data are analyzed in real-time to deliver personalized offers. The technology pattern for implementing such a use case with Oracle's real-time decisions and big data platform is outlined.
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
Learn about:
Why big data matters to your business: realize revenue, increase customer loyalty, and pinpoint effective strategies
The business and technical challenges of big data solutions
How to leverage big data for competitive advantage
The “must haves” of an effective big data solution
Real-world examples of Cloudera, Pentaho and Dell big data solutions in action
Similar to 2012.04.26 big insights streams im forum2 (20)
Cloud Data Services - from prototyping to scalable analytics on cloudWilfried Hoge
Presentation from the German customer conference of IBM's Technical Expert Council. It shows how IBM's cloud data services could be used to explore data for new insights or business models.
Is it harder to find a taxi when it is raining? Wilfried Hoge
Using open data to answer the question if it is harder to find a taxi, when it is raining. Live demo of analyzing taxi data with DashDB, R, and Bluemix.
Presented on data2day conference.
innovations born in the cloud - cloud data services from IBM to prototype you...Wilfried Hoge
To bring your ideas to get insights from new data sources to live you must have the capabilities to prototype, fail fast if they don't work and bring to production easily if they are successful. See how IBM's cloud data services can help you to start testing your ideas with data.
- The document discusses IBM's Watson cognitive computing platform, which understands natural language, learns from interactions, and generates hypotheses.
- Watson Analytics allows users to analyze data using natural language and includes features like predictive analytics, data visualization, and self-service analytics.
- The document outlines IBM's Watson services like personality insights and describes the process for building cognitive apps using the Watson Developer Cloud.
Analyze Twitter data completely in Bluemix. Collect data, add sentiment, copy to in-memory database, analyze with R or WatsonAnalytics. All in the cloud.
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
This document provides an overview and summary of InfoSphere BigInsights, an analytics platform for Hadoop. It discusses key features such as real-time analytics, storage integration, search, data exploration, predictive modeling, and application tooling. Case studies are presented on analyzing binary data and developing applications for transformation and analysis. Partnerships and certifications with other vendors are also mentioned. The document aims to demonstrate how BigInsights brings enterprise-grade features to Apache Hadoop and provides analytics capabilities for business users.
Presentation about BigData from a German Webcast: http://business-services.heise.de/it-management/big-data/beitrag/big-data-technologie-einsatzgebiete-datenschutz-160.html?source=IBM_12_2013_IT_Conn
InfoSphere BigInsights is IBM's distribution of Hadoop that:
- Enhances ease of use and usability for both technical and non-technical users.
- Includes additional tools, technologies, and accelerators to simplify developing and running analytics on Hadoop.
- Aims to help users gain business insights from their data more quickly through an integrated platform.
Understanding of Self - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Aggression - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
You may be stressed about revealing your cancer diagnosis to your child or children.
Children love stories and these often provide parents with a means of broaching tricky subjects and so the ‘The Secret Warrior’ book was especially written for CANSA TLC, by creative writer and social worker, Sally Ann Carter.
Find out more:
https://cansa.org.za/resources-to-help-share-a-parent-or-loved-ones-cancer-diagnosis-with-a-child/
Covey says most people look for quick fixes. They see a big success and want to know how he did it, believing (and hoping) they can do the same following a quick bullet list.
But real change, the author says, comes not from the outside in, but from the inside out. And the most fundamental way of changing yourself is through a paradigm shift.
That paradigm shift is a new way of looking at the world. The 7 Habits of Highly Effective People presents an approach to effectiveness based on character and principles.
The first three habits indeed deal with yourself because it all starts with you. The first three habits move you from dependence from the world to the independence of making your own world.
Habits 4, 5 and 6 are about people and relationships. The will move you from independence to interdependence. Such, cooperating to achieve more than you could have by yourself.
The last habit, habit number 7, focuses on continuous growth and improvement.
2. Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams
Wilfried Hoge – Leading Technical Sales Professional
hoge@de.ibm.com
twitter.com/wilfriedhoge
3. IBM Big Data Strategy: Move the Analytics Closer to the Data
New analytic applications drive
Analytic Applications
the requirements for a big data
BI / Exploration / Functional Industry Predictive Content
platform Reporting Visualization App App Analytics Analytics
• Integrate and manage the full
variety, velocity and volume of data IBM Big Data Platform
Visualization Application Systems
• Apply advanced analytics to & Discovery Development Management
information in its native form
• Visualize all available data for ad- Accelerators
hoc analysis
Hadoop Stream Data
• Development environment for System Computing Warehouse
building new analytic applications
• Workload optimization and
scheduling
• Security and Governance Information Integration & Governance
5. BigInsights – analytical platform for persistent “Big Data”
Based on open source & IBM
technologies Analytic Applications
BI / Exploration / Functional Industry Predictive Content
Distinguishing characteristics Reporting Visualization App App Analytics Analytics
• Built-in analytics . . . enhances business
knowledge IBM Big Data Platform
• Enterprise software integration . . . Visualization Application Systems
& Discovery Development Management
complements and extends existing
capabilities
• Production-ready platform with tooling for Accelerators
analysts, developers, and
administrators. . . speeds time-to-value Hadoop Stream Data
and simplifies development/maintenance System Computing Warehouse
IBM advantage
• Combination of software, hardware,
services and advanced research
Information Integration & Governance
6. About the BigInsights Platform
Flexible, enterprise-class support for processing large volumes of data
• Based on Google’s MapReduce technology
• Inspired by Apache Hadoop; compatible with its ecosystem and distribution
• Well-suited to batch-oriented, read-intensive applications
• Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes of
data in a highly parallel, cost effective manner
• CPU + disks = “node”
• Nodes can be combined into clusters
• New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written
7. Hadoop Explained – Map Reduce
Hadoop computation model
• Data stored in a distributed file system spanning many inexpensive computers
• Bring function to the data
• Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
public
static
class
TokenizerMapper
extends
Mapper<Object,Text,Text,IntWritable>
{
Hadoop Data Nodes
private
final
static
IntWritable
one
=
new
IntWritable(1);
private
Text
word
=
new
Text();
public
void
map(Object
key,
Text
val,
Context
StringTokenizer
itr
=
new
StringTokenizer(val.toString());
1. Map Phase
while
(itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word,
one);
}
(break job into small parts)
}
}
public
static
class
IntSumReducer
extends
Reducer<Text,IntWritable,Text,IntWrita
Distribute map 2. Shuffle
private
IntWritable
result
=
new
Intritable();
public
void
reduce(Text
key,
Iterable<IntWritable>
val,
Context
context){
int
sum
=
0;
for
(IntWritable
v
:
val)
{
tasks to cluster (transfer interim output
sum
+=
v.get();
.
.
.
for final processing)
MapReduce Application 3. Reduce Phase
(boil all output down to
Shuffle a single result set)
Result Set Return a single result set
8. BigInsights – Value Beyond Open Source
Technical differentiators
• Built-in analytics
• Text processing engine, annotators, Eclipse tooling
• Statistical and predictive analysis
• Interface to project R (statistical platform)
• Enterprise software integration (DBMS, warehouse)
• Spreadsheet-style analytical tool for analysts
• Ready-made business process accelerators
• Integrated installation of supported open source and IBM components
• Web Console for administration and application access
• Platform enrichment: additional security, performance features, . . .
• Standard IBM licensing agreement and world-class support
Business benefits
• Quicker time-to-value due to IBM technology and support
• Reduced operational risk
• Enhanced business knowledge with flexible analytical platform
• Leverages and complements existing software assets
9. Web Installation Tool
Seamless process for single
node and cluster environments
Integrated installation of all
selected components
Post-install validation of IBM and
open source components
No need to iteratively download, configure, and test multiple open source
projects and their pre-requisite software.
10. Web Console
Manage BigInsights
• Inspect system health
• Add / drop nodes
• Start / stop services
• Run / monitor jobs (applications)
• Explore / modify file system
Launch applications
• Spreadsheet-like analysis tool
• Pre-built applications (IBM supplied
or user developed)
Publish applications
Leverage community resources
11. BigSheets
BigSheets is a visual tool for data manipulation and prototyping
• Allows more users to do more work, more quickly
• Simply stated, growing an army of MapReduce developers is not cost effective
• In your BI environments you have a ratio of 30+ report users for every complex SQL
developer. We need to support the same ratios with BigInsights
Sample Uses
• Data exploration and visualization
• Visual job creation
14. Quick start applications or “apps”
Reusable software assets based on customer engagements
• Useful for starting point for various applications
• Can be customized by BigInsights application developers as needed
• Accessible through Web console
Available assets
• Data export (to relational DBMS, files, HBase)
• Data import (from relational DBMS, files)
• Web crawler, Twitter crawler
• Boardreader.com support (Web forum search engine)
• Ad hoc queries for Jaql, Hive, Pig
• TeraGen-TeraSort, WordCount sample applications
17. Build a Big Data Program – Map Reduce example
Eclipse based development tools
For JAQL, Hive, Java MapReduce, Text Analytics
18. Text Analytics in BigInsights
Text analytics – Distill structured information from unstructured data
• Rich annotator library supports multiple languages
• Declarative Information Extraction (IE) system based on an algebraic framework
• Richer, cleaner rule semantics
• Better performance through optimization
Developed at IBM Research since 2004
Embedded in several IBM products
• Lotus Notes
• Cognos Consumer Insights
• InfoSphere Streams
• Compose operators to build complex annotators
19. Turns disparate words into measurable insights
Pre-configured text annotators ready for distributed processing on Big Data
• City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent,
EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger,
Acquisition, Alliance, etc..
Support for native languages including double-byte
Physically assemble Identify positive or Reporting/Monitoring social
data, standardize Part-of-speech negative sentiment, Iterative classification commentary, combination w/
formats, address auto- identification, standard and NLP-based analytics, using automated and structured data, clustering,
identify language, customized extraction define variables, macros manual techniques. associated concepts,
process punctuation dictionaries, proper noun and rules. Concept derivation & correlated concepts, auto-
and non-grammatical identification, concept inclusion, semantic classification of documents,
characters, standardize categorization, synonyms, networks and co- sites, posts.
spelling. exclusions, multi-terms, occurrence rules
regular expressions, fuzzy-
matching
20. Text Analytics – highly accurate analysis of textual content
How it works Unstructured text (document, email, etc)
• Parses text and detects meaning with
annotators Football World Cup 2010, one team
distinguished themselves well, losing to
• Understands the context in which the
the eventual champions 1-0 in the Final.
text is analyzed
Early in the second half, Netherlands’
• Hundreds of pre-built annotators for striker, Arjen Robben, had a breakaway,
names, addresses, phone numbers, but the keeper for Spain, Iker Casillas
along others made the save. Winger Andres Iniesta
scored for Spain for the win.
Accuracy
• Highly accurate in deriving meaning
from complex text
Performance Classification and Insight
• AQL language optimized for
MapReduce
23. Statistical and Predictive Analysis
Framework for machine learning (ML) implementations on Big Data
• Large, sparse data sets, e.g. 5B non-zero values
• Runs on large BigInsights clusters with 1000s of nodes
Productivity
• Build and enhance predictive models directly on Big Data
• High-level language – Declarative Machine Learning Language (DML)
• E.g. 1500 lines of Java code boils down to 15 lines of DML code
• Parallel SPSS data mining algorithms implementable in DML
Optimization
• Compile algorithms into optimized parallel code
4500
• For different clusters and different data characteristics 4000
3500
• E.g. 1 hr. execution (hand-coded) down to 10 mins
Execution Time (sec)
3000
2500
2000
1500
1000
500
0
0 500 1000 1500 2000
# non zeros (million)
Java Map-Reduce SystemML Single node R
24. Workload Optimization
Optimized performance for big data analytic workloads
Adaptive MapReduce Hadoop System Scheduler
§ Algorithm to optimize execution time of § Identifies small and large jobs from
multiple small jobs prior experience
§ Performance gains of 30% reduce § Sequences work to reduce overhead
overhead of task startup
Task Map Adaptive Map Reduce
(break task into small parts) (optimization — (many results to a
order small units of work) single result set)
25. InfoSphere BigInsights – Embrace and Extend Hadoop
Analytics
ML Analytics Text Analytics BigSheets Interface
Web console
Application • Monitor cluster health
Pig Hive Jaql • Add / remove nodes
Avro
Zookeeper
IBM LZO Compression
• Start / stop services
MapReduce • Inspect job status
• Inspect workflow status
• Deploy apps
AdaptiveMR FLEX BigIndex • Launch apps / jobs
• Work with distrib. file system
• Work with spreadsheet
Oozie Lucene
interface
• Support REST-based API
• . . .
Storage HBase
Eclipse plug-ins
HDFS GPFS-SNC
• Text analytics
• MapReduce programming
• Jaql development
Data Sources/ Netezza BoardReader R • Hive query development
Streams
Connectors
Data Stage DB2 CSV / XML / JSON SPSS
IBM
Flume JDBC Web Crawler
Open Source
26. Ways to get started with BigInsights
In the Cloud
• Via RightScale, or directly on Amazon, Rackspace, IBM
Smart Enterprise Cloud, or on private clouds.
• Pay only for the resources used.
In the Virtual Classroom
• Free Hadoop Fundamentals training course
www.bigdatauniversity.com
• e.g. BD105EN - Text Analytics Essentials
On Your Cluster
• Download Basic Edition from ibm.com.
In the Classroom
• Enroll in the InfoSphere BigInsights Essentials course.
27. Visit the BigInsights technical portal . . . .
Free links to papers, demos, discussion forum, and more
http://www.ibm.com/developerworks/wiki/biginsights/
28. Streams – analytical platform for in-motion “Big Data”
Built to analyze data in motion
Analytic Applications
• Multiple concurrent input streams
BI / Exploration / Functional Industry Predictive Content
Reporting Visualization App App Analytics Analytics
• Massive scalability
IBM Big Data Platform
Process and analyze a variety of Visualization Application Systems
data & Discovery Development Management
• Structured, unstructured content, video,
audio Accelerators
• Advanced analytic operators
Hadoop Stream Data
System Computing Warehouse
Information Integration & Governance
29. Stream Computing – Analyze Data in Motion
Traditional Computing Stream Computing
Historical fact finding Current fact finding
Find and analyze information stored on disk Analyze data in motion – before it is stored
Batch paradigm, pull model Low latency paradigm, push model
Query-driven: submits queries to static data Data driven – bring the data to the query
Query Data Results Data Query Results
30. Why InfoSphere Streams?
Applications that require on-the-fly processing, filtering and analysis of
streaming data
• Sensors: environmental, industrial, surveillance video, GPS, …
• “Data exhaust”: network/system/web server/app server log files
• High-rate transaction data: financial transactions, call detail records
Criteria: two or more of the following
• Messages are processed in isolation or in limited data windows
• Sources include non-traditional data (spatial, imagery, text, …)
• Sources vary in connection methods, data rates, and processing requirements,
presenting integration challenges
• Data rates/volumes require the resources of multiple processing nodes
• Analysis and response are needed with sub-millisecond latency
• Data rates and volumes are too great for store-and-mine approaches
31. Massively Scalable Stream Analytics
Linear Scalability Deployments
§ Clustered deployments – unlimited Source Analytic Sync
scalability Adapters Operators Adapters
Automated Deployment
§ Automatically optimize operator
Streams Studio IDE
deployment across clusters
Performance Optimization Automated and
Optimized
§ JVM Sharing – minimize memory use Deployment
§ Fuse operators on Streaming Data Streams Runtime
Sources
same cluster
§ Telco client – 25 Million
Visualization
messages per second
Analytics on Streaming Data
§ Analytic accelerators for a
variety of data types
§ Optimized for real-time performance
33. InfoSphere Streams for superior real time analytic processing
Streams Processing Language (SPL)
built for Streaming applications: Compile groups of operators into
• Reusable operators single processes:
• Rapid application development • Efficient use of cores
Use the data • Continuous “pipeline” processing • Distributed execution
that gives • Very fast data exchange
you a competitive • Can be automatic or tuned
advantage: • Scaled with push of a button
• Can handle virtually
any data type
• Use data that is too
expensive and time
sensitive for traditional
approaches
Easy to extend:
• Built in adaptors
• Users add capability
with familiar C++ and
Java
Dynamic analysis:
Easy to manage: • Programmatically change
Flexible and high
• Automatic placement topology at runtime
performance transport: • Create new subscriptions
• Extend applications incrementall
• Very low latency • Create new port properties
without downtime
• High data rates
• Multi-user / multiple applications
35. Compiler Framework
Operator Fusion
• Fine-grained operators
Logical app view
• From small parts, make larger ones
that fit
Code generation
• Generates code to match the underlying
runtime environment
• Number of cores
• Interconnect characteristics
Physical app view
• Architecture-specific instructions
• Driven by automatic profiling
• Compiler-based optimization
• Driven by incremental learning of
application characteristics
36. Streams Data Mining Toolkit
Enables scoring of real-time data in a Streams application
• Scoring is performed against a predefined model
• Supports a variety of model types and scoring algorithms
Models represented in Predictive Model Markup Language (PMML)
• Standard for statistical and data mining models
• XML Representation
Toolkit provides four Streams operators to enable scoring
• Classification
• Clustering
• Regression
• Associations
The toolkit supports dynamic replacement of the PMML model used by an
operator.
37. Without a Big Data Platform IBM Big Data Platform
You Code…
Over 100 sample applications and toolkits with industry
focused toolkits with 300+ functions and operators
Event Custom SQL
Handling and
Scripts
Multithreading
Check Application
Pointing Management Accelerators
Streams provides development, deployment,
HA and runtime, and infrastructure services
Toolkits
Performance Debug
Connectors
Optimization
Security “TerraEchos developers can deliver
applications 45% faster due to the agility
of Streams Processing Language…”
– Alex Philip, CEO and President, TerraEchos
40. Example of 360° customer view
Business Processes"
Events and Master Data Campaign Cognos Consumer
Alerts Management Management Insight
Big Data Platform
Web Traffic and
Social Media Insight
Website Logs
Social Media Internet Scale Analytics
Information Data
Integration Warehouse
Call Detail Call Behavior and
Records Streaming Analytics Experience Insight
41. Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams