High level view of the confusing world of 'Big Data'. The mission of the non-profit, American Institute of Big Data Professionals (AIBDP) is to provide structure and standards around terminology, proficiency, methodology, and expectations around Big Data.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
AIBDP Agenda Big Data Connections
1. Agenda
• Quick Poll
• Overview – AIBDP / Big Data Connection
• Prasad Mavuduri – Board Member, AIBDP –
“Demystifying Big Data”
• David Sonnenschein – Vice President & Aleks
Swerdlow Community Manager – SAP Labs -
HANA In-Memory – Start-ups Success Stories”
• Networking & Q&A
2. Welcome
• Thank you: Francis - Silicon Valley Strategy,
Innovation and Product Management group
• Thank you: Michael & Sam and the Microsoft
Store
• Thank you: Aleks & David & SAP HANA
• Thank you: All of You… You are the ‘Secret
Sauce’
3. Quick Poll
• Relationship & Experience w/ Big Data
• Job Role
• Industry
• Company Years - Start-up?
• Big Data Implementation Status
• Biggest Challenges / Opportunities
– – Ask the right question…
• Vs Competitors?
4. Overview - Big Data Connections
Mission: Demystify Big Data
– Five E’s – entertain, engage, educate etc
– Focus on Solutions (vs technology)
– Focus on Specific Verticals
• ex Healthcare, Risk, eCom/eMarketing,
Manufacturing, Logistics, Telecom…)
– Best Practices Case Study Reviews
– Networking & Shared Learning
– Sponsored by the American Institute of
Big Data Professionals (AIBDP.org)
– Sponsored by Big Data consulting firm,
Data-Magnum
5. BI Platform / Reporting
OSS
Visualizations
Unstructured/ Search
Indexing / Metadata
Search
NLP
Hadoop Analytics
Hadoop Dev Platforms / Automation
HDFS
Predictive Analytics
THE CONFUSING WORLD OF BIG DATAAPPLICATIONSTOOLSDATAMANAGEMENT
STRUCTURED UNSTRUCTURED
Transactional
DB
OSS
High Performance
Analytical DB
NewSQL
Enhancement
Distributed
NoSQL
Graph Document
Key Value /
Column
Enterprise
Apps
Internet
Apps
Social Media Web Content Mobile Devices Camera / DVR Sensors / RFID Logfiles
Hadoop
aaS
HDFS Alternatives
DBaaS
HANA
GraphDB
Filesystem
EMR
Text / Sentiment Analysis
Data as a Service
Data
Warehouses
vFabric L
Drill
Vertical Market Applications
Impala
Messaging Optimization Data Integration / CEP
OSS
IMDG
Redshift
Based on Source: Perella Weinberg Partners
AI
17. It can be made more complicated…
o Hadoop
o NoSQL
o NewSQL
o Structured Databases
o NGDW (next generation data warehouse)
o Cloud Services
o Technical Services
o Professional Services
o Distributors
o Deployment services
o Deployment stack/appliances
o Development services
o Application stacks
o Database stacks
o Managed Monitoring
o Storage
o Security
Source: sqrll:To simplify the NoSQL world, lets take a look at the top 3 databases in terms of current popularity and how they compare to Apache Accumulo, which is at the core of our product, Sqrrl Enterprise.MongoDB: It is a wonderfully easy-to-use document store that many select as a flexible replacement for a SQL database, as it (like all NoSQL databases) does not require pre-defined schemas. However, MongoDB has difficulty scaling to very large datasets (e.g., 100+ TB) and does not natively work with your Hadoop cluster. It also does not possess fine-grained security controls.Cassandra: This is an excellent choice if your data is too big for MongoDB and you require multi-datacenter replication. Although Cassandra was not originally designed to run natively on your Hadoop cluster, it now has integrations with MapReduce, Pig, and Hive. It does not possess fine-grained security controls.HBase: HBase natively integrates with Hadoop, and it can handle very large datasets. However, it does not have fine-grained security controls. Accumulo: Accumulo has an architecture most similar to HBase, which allows it also to natively plug into your Hadoop cluster. It is far more scalable than MongoDB, and with reported cluster sizes in the multiple thousands within the Intelligence Community it is also significantly more scalable than HBase and Cassandra. Accumulo is the only NoSQL database with cell-level security capabilities. Accumulo also has other features that could lead one to choose it over HBase or Cassandra for reasons other than security or scalability. For example, Accumulo has a powerful server-side programming mechanism called Iterators, which provide it with the capability to do a variety of real-time aggregations and analytics.These high level differences between MongoDB, Cassandra, HBase, and Accumulo are summarized in the decision tree diagram below. Of course, there are a wide variety of more detailed technical differences that will be explored in greater detail in a later post. This decision tree can be summarized with a few simple statements:If you need a quick, simple solution and have “small” Big Data (e.g., a few dozen terabytes), MongoDB may be the answer.If you need cell-level security or multi-petabyte scalability, Accumulo is the right answer.If you have data that is too big for MongoDB and don’t need cell-level security or massive scalability, we would recommend testing HBase, Cassandra, and Accumulo for your specific workloads. Each has their own nuanced advantages and disadvantages.If you don’t need real-time analytics, you are probably on the wrong decision tree and can stick with the Hadoop Distributed File System and batch analytics. It is worth noting that the NoSQL databases above are all open source databases. Sqrrl Enterprise builds upon Accumulo and adds a number of additional features to Accumulo including streaming ingest, JSON, encryption, identity management integrations, full-text search, SQL queries, graph search, and statistics. We believe that these features set Sqrrl Enterprise apart from other Big Data platforms.
http://www.capgemini.com/blog/capping-it-off/2012/09/big-data-vendors-and-technologies-the-listBig Data Vendors and TechnologiesData Acquisition stream - technological providers Ab InitioHPIBM (Datastage, Streams, Data mirror)Informatica (PowerCenter, PowerExchange, CEP)KalidoMicrosoftNumentaOracleSAPSASSplunkSyncsortTalendTibcoData ProvidersComScoreDatasiftExperianFactualGfKGnipIMSInrixKaggleKnoemaLexisNexisMicrosoft (with their Windows Azure Marketplace data market)NielsenReutersSalesforce Radian6Symphony IRIsocial network websites like Facebook, Google+, LinkedIn, Tumblr, Twitter or Viadeoall the Open Data providers, like governments, regions, etc.Marshalling domain - Very Large Data Warehousing and BI AppliancesActian; ParaccelEMC² (Greenplum)HP (Vertica)IBM (Netezza)KognitioMicrosoft (SQL 2012 and PDW)Oracle (Exadata)SAP (HANA and Sybase IQ)SASTeradataNoSQL Domain – Main technologies and vendors: Amazon (as cloud provider or with their own NoSQL solution)CassandraCloudera (CDH, Hadoop distribution)CouchDBEMC²GoogleHadoop (of course)GoogleHortonworks (Hadoop distribution)HPIBMKXMapR (Hadoop distribution)MarkLogicMicrosoft (Hadoop on Windows and Azure)MongoDBNeo4JOraclePalantirSnaplogicSparsitySplunkTeradata (Aster Data)ZL TechnologiesContent Management Space:AdobeAlfrescoEMC² (Documentum)IBM (FileNet)HP (Autonomy)MicrosoftOpenTextOracle.Analytics phasePredictive technologies (such as data mining) and vendors which are Adobe, EMC², GoodData, Hadoop Map Reduce, HP, IBM (SPSS), Karmasphere, Kxen, Microsoft, Mzinga, Oracle, R, Salesforce, SAS, SAP (R on HANA) and Teradata (Aprimo). Data Virtualization (and data federation) is currently led by Composite, Denodo, HP (IDOL), IBM, Informatica, Microsoft, Oracle (Exalytics), SAP and Teiid (JBoss community).c BI Tools Vendors:ActuateDassaultSystèmes (Exalead)DomoEsriGoodDataGoogleHP (Autonomy)IBM (Cognos suite)Information BuildersLogiXML (LogiAnalytics)Microsoft (SQL 2012)MicrostrategyNeutrinoBIOracle (OBI Foundation)PanopticonPanoramaPentahoQlikviewRoambiSAP (BI4 suite)SASSpagoBITableauTIBCO Spotfire.Action Phase - Data Acquisition providers plus the ERP, CRM and BPM actorsAdobeEloquaEMC²IBMiGrafxMicrosoftOpenTextOraclePegaProgress softwareSAPSalesforceSoftware AGTeradata (Aprimo) Tibco.Data Governance area - Master Data Management (MDM), metadata and data quality toolsAdaptiveHPIBMInformaticaKalidoMicrosoftOracleOrchestra NetworksSAPSASTalendTibco. Note that the Complex Event Processing (CEP) Tools are part of Acquisition (streaming data acquisition), Marshalling (eg in-memory storage as data is used or compared immediately) and Analytics (eg Monitoring functions to detect abnormal activity) streams.Note that the BI Tools are part of Analytics (Computing Key Performance Indicators) and Action (eg Creating Alerts in a push mode by mail for instance) streams.
Citrisleaf = AerospikeCouchbase – roots are in Northscale – Membase .. CouchDB; two focus audiences – Enterprise & funnel
Analytics Infrastrucure = MPP – Distributed open-source, Apache-licensed distribution of Apache Hadoop ... Open source, Massively Parallel Processing (MPP) query engineInfrastucure ad a Service = Cloud IaaSOperational Infrastructure = Structure of Data – ex JSAN; ad-hoc queries; unstructured data; behaviorial, redundencyNot Listed – Hardware / Storage – NetApp, EMC, HP
Per Forbes (per Wikibon), Big Data is an $18 billion industry heading to $50 billion in five years. The companies in the inner-circle (ex: MapR, Cloudera, Splunk, Couchbase etc) are pure-plays within Big Data. A theory is these inner-circle players will probably get gobbled up by the big boys on the outside, who are just starting to play in the Big Data space (like SAP, Microsoft, Oracle, IBM…) In the meantime, the relative sizes of the circles reflects the relative size of the companies, in terms of revenue. The percentages reflect the % of their current business that is ‘big data’
5/18/13 w/ Paul HofmannPalantir – just text; just Homeland SecurityOracle Endica – addedHP Autonomy AddedAttivio (partner with TIBCO added)Saffron – Semantec and .. (Risk predictive) added0xData – changed logoMuSigma -= Consultant onlyRecorded Future -= Timeline; Opera = Text-only?; No predictive Analytics?Kxen – nice companySAS – Dead? Not scalable; Skytree = a platform / toolbox.. You need to have yoru own Data Quant to create yuur own analytics Sociocast – Saffron PartnerDigital Reasoning – Strong with Dept of Defense too
NoSQL databases currently available include:Hbase (Apache)Cassandra (DataStax)MarkLogic (MarkLogic)Aerospike (CitrixDB)MongoDB (10gen)Accumulo (Apache)Riak (Basho)CouchDB (CouchBase)DynamoDB (Amazon)Sqrrl (?)VoltDB (?)http://thinkbiganalytics.com/leading_big_data_technologies/nosql/NoSQLNoSQL is an umbrella term for a broad class of database management systems that relax some of the tradition design constraints of relational database management systems (RDBMS) in order to meet goals of more cost-effective scalability, flexible tradeoffs of availability vs. consistency (as described by the CAP theorem), and flexibility for data structures that don’t fit well into the relational model, such as key-value data and large graphs. NoSQL databases typically don’t offer ACID transactions nor full SQL dialects.The NoSQL ecosystem is very large. Among the better known databases are HBase, Cassandra, Aerospike, DynamoDB, MongoDB, Riak, Redis, Accumulo, Datatomic, and Couchbase. Of these, HBase and Accumulo are more closely tied to Hadoop than the others, as both use HDFS, by default, for persistent storage and Zookeeper for service federation.NoSQL databases expose different information models, including key-value records, JSON or XML documents as records, or graph-oriented data. They expose corresponding programmer APIs and sometimes custom query languages that may or may not be SQL-based. However, a recent trend in this industry is the re-introduction of restricted SQL dialects to support the large user community accustomed to SQL and improving support for transactions.As an example of a scenario where a NoSQL database is a good fit, an event log for a web site might be captured in a key-value store, where fast appends and key-based retrievals are required, but not updates nor joins.HBaseHBase is a distributed, column-oriented database, where each cell is versioned (a configurable number of previous values is retained). HBase provides Bigtable-like capabilities on top of Hadoop. SQL queries (but not updates) are supported using Hive, but with high latency. Eventually, Impala will also support Hive queries with lower latency. Like many NoSQL databases, HBase does not support complex transactions, SQL, or ACID transactions. However, HBase offers high read and write performance and is used in several large applications, such as Facebook’s Messaging Platform. By default, HBase uses HDFS for durable storage, but it layers on top of this storage fast record-level queries and updates, which “raw” HDFS doesn’t support. Hence, HBase is useful when fast, record-level queries and updates are required, but storage in HDFS is desired for use with Pig, Hive, or other MapReduce-based tools.Cassandra Cassandra is the most popular NoSQL database for very large data sets. It is a key-value, clustered database that uses column-oriented storage, sharding by key ranges, and redundant storage for scalability in both data sizes and read/write performance, as well as resiliency against “hot” nodes and node failures. Cassandra has configurable consistency vs. availability (CAP theorem) tradeoffs, such as a tunable quorum model for writes.MongoDB MongoDB is a document-oriented NoSQL database where each record is a JSON document. It has a rich, Javascript-based query language that exploits the implicit structure of JSON. MongoDB supports sharding for improved scalability and resilience. It is most popular for small to large data sets and less commonly used for very large data sets.DynamoDBDynamoDB is Amazon’s highly scalable and available, key-value, NoSQL database. DynamoDB was one of the earliest NoSQL databases and papers written about it influenced the design of many other NoSQL databases, such as Cassandra.CouchbaseCouchbase is a key-value NoSQL database that is well-suited for mobile applications where a copy of a data set is resident on many devices, where changes can be performed on any copy, and copies are synchronized when connectivity is available. Think of how an email client works with local copies of your email history and corresponding email servers. RedisRedis is a key-value store with the specific support for fundamental data structures as values, including strings, hash maps, lists, sets, and sorted sets, whereas most key-value stores have limited understanding of a value’s meaning, except to represent the value as column cells, if many cases. For this reason, Redis is sometimes called a data structure server. Redis keeps all data in memory, which improves performance, but limits the data set sizes it can manage. Durability is optional, by periodic flushing to disk or writing updates to an append log. Master slave replication is also supported. Datomic Datomic is a newer entrant in the NoSQL landscape with a unique data model that remembers the state of the database at all points in the past, making historical reconstruction of events and state trivial. Many standard database operations are supported, including joins and ACID transactions. Deployments are distributed, elastic, highly available. RiakRiak is a fault-tolerant, distributed, key-value NoSQL database designed for large-scale deployments in cloud or hosted environments. A Riak database is masterless, with no single points of failure. It is resilient against the failure of multiple nodes and nodes can be added or removed easily. Riak is also optimized for read and write-intensive applications.