SlideShare a Scribd company logo
1 of 23
Unraveling the Mystery of Big Data
Lionel Silberman
May 22, 2014
Copyleft – Share Alike
1
http://www.bigstockphoto.com/image-12115340/stock-photo-binary-stream
Who Am I?
Lionel Silberman, currently the Senior Data Architect at Compuware
• 30 years in Software Development
• Statistical modeling, DBMS, Big Data, Data Architecture, tech/product/management.
• Diverse, deep data management:
– All of the major RDBMS vendors and internals.
– Data modeling and data parallelism techniques.
– OLAP, OLTP, MPP, NoSql systems like Hadoop and Cassandra.
– Scaling and Performance tuning of distributed and federated applications.
• Current interest is integrating products in the enterprise at the data level
that deliver more value than their individual pieces.
• Active interest in big data metadata privacy issues.
• Who are you? What’s your interest in this talk tonight?
2
3
Unraveling the Mystery of Big Data Agenda
4
• What is Big Data?
 Business Value
 Technical Definitions
 Sizes and Applications
• What Big Data is Not (or why isn’t everything just “data”)?
• Architectural Underpinnings
• Some Useful Architectural Distinctions
• Technology stacks and ecosystems
• Data Modeling Example
• Gotchas - What 12 things to watch out for?
• References and more info
• Questions?
Lionel Silberman
What is Big Data?
Business Value
Enabling new products
• Sensors everywhere!
• Nowcasting.
• Ever-narrower segmentation of customers
Analytics - taking data from input through to decision
• Correlation in real time
• New insights from previously hidden data:
• Social
• Geographical data
• Recommendations.
• Finding needles in haystacks.
• In 2010, industry was worth more than $100 billion.
• Almost 10 percent a year growth rate;
• or, about twice as fast as the software business as a whole.
5
What is Big Data?
A Technical Definition
• Data that exceeds the processing capacity of conventional database
systems in volume, velocity or variety*
or the 3Vs!
Volume - Sheer size and growth.
Velocity - how fast it moves.
Variety - the inability to derive structure or frequency of change.
* META Group (now Garner) analyst Doug Laney
6
What is Big Volume?
• 1970s: Megabytes
• Now: Many organizations are approaching an Exabyte
• Examples:
• Google – capacity for 15 Exabytes
• NSA – capacity for a Yottabyte in Utah
• AWS – 1 Trillion objects in 2012
• Facebook – 500 Terabytes/day
• Scientific Pursuits:
• Large Hadron Collider at CERN - last year 30 Petabytes
• The NASA Center for Climate Simulation - 32 petabytes of climate observations
and simulations on the Discover supercomputing cluster.
• Sloan Digital Sky Survey (SDSS) 140 terabytes. 200 GB per night.
Large Synoptic Telescope anticipated to acquires 140TB every five days.
• eBay.com - 90 Petabytes in data warehouses
7
What is Big Velocity?
• Financial Trading volume
• Retail – Cyber Monday
(every click and interaction: not just the final sales).
• Government – Affordable Care Act
• Smartphone - geolocated imagery and audio data
• Fraud (complex event processing)
o Credit Card Traffic patterns
o Phone Slam/Crams
• Streaming – Netflix, Snapfish
• Retail - Walmart handles more than 1 million customer
transactions per hour.
• Compuware APM (my firm) 25K transactions per second.
• MMOG – Massive Multiplayer Online Games
http://www.livestream.com/ibmpartnerworld/video?clipId=flv_052b14ea-7d5a-40f5-9e61-e287b0ce5d9c
8
What is Big Variety?
• Diverse source and destinations:
o Document Backup or Archival – HP, EMC, AWS
o Pictures and Video – Facebook 50 Billion Photos
o Sensor sources – GE, NetApp
o Multi-device - Dropbox and Sugarsync
• Big Data is messy:
o structure aids meaning, but can change frequently.
o multiple sources (e.g. financial feeds, browser incompatibilities)
o Application integration issues (e.g. Fitbit)
o Entity resolution issues (e.g. Portland, dog)
o Visualization increasingly important.
9
Or a 4th or 5th V?
10
Big Data and Visualization - Wikipedia
11
http://infodisiac.com/Wikimedia/Visualizations/
What Big Data is Not (or why isn’t everything just “data”)?
• There may not be the need in traditional systems
o Payroll
o Human Resources
o Shop machine sampling?
• Some tradeoffs required for the technology of Big Data:
o Bleeding edge vs. established technology
o Subtle definition of consistency
o Complexity
o New and hard-to-find skills
• Make sure the business case warrants and can tolerate the
tradeoffs…
12
Big Data is Everywhere?
13
Architectural Underpinnings: CAP Theorem
14
High Availability (A)
of data for writes.
Consistency (C)
a single up-to-date copy
of the data.
Partition Tolerance (P)
the system continues to
operate despite arbitrary
message loss or failure of
parts of the system.
NoSQL
DBMS
X
NoSql and Eventual Consistency
• Relaxed or weaker consistency to achieve High Availability and
Partition Tolerance.
• Eventual Consistency:
– An unbounded delay in propagating changes across partitions.
– No ordering guarantees at all, thus a lower-level transaction
atomicity.
From system perspective means an operational system is
ALWAYS in an inconsistent state.
• Many NoSql systems (e.g. Cassandra, Hadoop):
– support self-healing or restartability.
– allow ease of scale and Disaster Recovery
– are schema-free.
– No standard way of retrieving data (e.g. SQL).
15
Business and Infrastructure Architecture Decisions
16
Infrastructure:
• Amazon Web Services (AWS)
• Storage
• Elasticity
• Availability
• Data division:
• Parallelism (sharding)
• Redundancy
• Application Servers
• Data affinity
• Stateless protocols
Business Issues:
• Research to Production Pipeline?
• 3rd party integration needs?
• Flexibility?
• Radical Transparency?
Data Architecture Decisions
17
Data In-transit:
• high write vs. high reads
• Encryption and security
• Distributed vs. centralized
• Visualization tools
Data Stores:
• Documents
• key/value pairs
• Graphs
• In-memory
Technology Stacks (Data Stores)
• Hadoop: Distributed File System, Job Scheduler, MapReduce programming model
– Pros: fault-tolerant, disaster protection, data parallelism fits many applications, ecosystem.
– Best used: long-term data storage, research, basis of other flexible data stores.
• Cassandra: Big Table, key-value
– Pros: Fast writes, no single-point of failure, fault-tolerant, disaster, columns and column families.
– Best used: when you write more than you read, Financial industry, real-time data analysis.
• Riak: key-value
– Pros: Cassandra-like, less complex, single-site scalability, availability & fault-tolerance
– Best used: Point-of-sales, factory control systems, high writes.
• Redis: in-memory, key-value
– Pros: fast, transactional, expiring values.
– Best used: Rapidly changing data in memory (stock prices, analytics, real-time data collection.
• Dynamo: Big Table, key-value
– Pros: Fast reads and writes, no single-point of failure, fault-tolerant, disaster, eventually consistent
– Best used: Always available (e.g. Amazon).
• CouchDB: Documents
– Pros: bi-directional replication, conflict detection, previous versions
– Best used: Accumulating occasionally changing data, pre-defined queries, versioning
• MongoDB: document store.
• Pros: update-in-place, defined index, built-in sharding, geospatial indexing.
• Best used: Dynamic queries, schema-less, data changes a lot,
• HBase: Big Table
– Pros: huge datasets, map-reduce Hadoop/HDFS stack.
– Best used: Analyzing log data
• Memcached/Membase: in-memory and multi-node.
– Pros: low-latency, high-concurrency and availability.
– Best used: zynga – online gaming.
• Neo4j: GraphDB
– Pros: highly scalable, robust, ACID
– Best used: social, routing, recommendation questions (e.g. How do I get to Linz?)
18
Related Data Management Ecosystems
• Apache Hadoop stack (Cloudera)
– MapReduce on HDFS
– Pig, Hive, Hbase – SQL-like DB interfaces on top of HDFS.
– Flume, RabbitMQ – Data conduits and message queues.
– Splunk – Operational Analytics and Log Processing.
– Sqoop – Bulk Data transfer to DBs
– Puppet, Chef – Configuration Management and DevOps Orchestration.
– Visualization, BI and ETL - Informatica, Talend, Pentaho, Tableau
• Cloud computing infrastructure (Amazon Web Services)
e.g. EC2, Elastic MapReduce, RDS
• Cassandra (Datastax)
• High-scale, distributed and hybrid RDBMS:
- Teradata
- Netezza
- EMC/Greenplum
- Aster Data
- Vertica
- VoltDB
- RDF Triple Stores
- Hadapt
19
Data Modeling Example: Twitter Publishers and Subscribers
20
• Relational DB: One table that has people relationships and tags
whether a publisher or subscriber.
Pros: No duplicated data, ACID transaction
Cons: Does not scale, SPOF
• NoSQL: Separate Indices for Subscribers and Publishers.
Pros:
– Partition Independence enables scale-out and no single point of failure for
both reads and writes.
– No Schema allows quick development.
Cons:
– Eventual consistency requires care in application layer and presentation of
user experience, and future evolution.
– Redundant storage.
Gotchas - What 12 Things to Watch Out for?
1. Privacy
2. Abuse – e.g. HFT/Front running.
3. Immature technologies and companies.
4. Business and Product Changes affect on architecture.
5. Data in-transit vs. at-rest - replication, mirroring, streaming,
reprocessing.
6. Data security in-transit and at-rest.
7. Blurring of high availability, performance and disaster recovery.
8. Replacing sampling and aggregation with ALL of the data!
9. Correlation is not Causation – e.g. Google Flu
10. Data Snooping (or Confirmation Bias) http://tylervigen.com
11. Irrelevance
12. Veracity - how do you check and reproduce results?
“The process of making is iterative” - Cesar A. Hidalgo
21
References and More Info
• http://en.wikipedia.org/wiki/Big_data
• http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html
• http://strata.oreilly.com/2012/01/what-is-big-data.html
• http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html
• http://www.cnet.com/news/nsa-to-store-yottabytes-in-utah-data-centre/
• http://aws.typepad.com/aws/2012/06/amazon-s3-the-first-trillion-objects.html
• http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/
• http://iveybusinessjournal.com/topics/strategy/why-big-data-is-the-new-competitive-
advantage#.U12f8aPD9jo
• http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
• Animation: Large Hadron Collider at CERN: http://home.web.cern.ch/about/updates/2013/04/animation-
shows-lhc-data-processing
• http://ivoroshilin.com/2012/12/13/brewers-cap-theorem-explained-base-versus-acid/
• http://www.scientificamerican.com/article/saving-big-data-from-big-
mouths/?utm_content=buffer53ae6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
• http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz31AZrOtI6
• Copyleft – Share Alike - http://creativecommons.org/licenses/by-sa/3.0/
22
Questions? Use Cases? Technology Adoption?
Feedback or follow-up: Lionel.Silberman@Compuware.com
23

More Related Content

More from AnalyticsWeek

Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling
AnalyticsWeek
 

More from AnalyticsWeek (7)

Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in government
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3
 

Recently uploaded

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 

Unraveling the mystery of big data

  • 1. Unraveling the Mystery of Big Data Lionel Silberman May 22, 2014 Copyleft – Share Alike 1 http://www.bigstockphoto.com/image-12115340/stock-photo-binary-stream
  • 2. Who Am I? Lionel Silberman, currently the Senior Data Architect at Compuware • 30 years in Software Development • Statistical modeling, DBMS, Big Data, Data Architecture, tech/product/management. • Diverse, deep data management: – All of the major RDBMS vendors and internals. – Data modeling and data parallelism techniques. – OLAP, OLTP, MPP, NoSql systems like Hadoop and Cassandra. – Scaling and Performance tuning of distributed and federated applications. • Current interest is integrating products in the enterprise at the data level that deliver more value than their individual pieces. • Active interest in big data metadata privacy issues. • Who are you? What’s your interest in this talk tonight? 2
  • 3. 3
  • 4. Unraveling the Mystery of Big Data Agenda 4 • What is Big Data?  Business Value  Technical Definitions  Sizes and Applications • What Big Data is Not (or why isn’t everything just “data”)? • Architectural Underpinnings • Some Useful Architectural Distinctions • Technology stacks and ecosystems • Data Modeling Example • Gotchas - What 12 things to watch out for? • References and more info • Questions? Lionel Silberman
  • 5. What is Big Data? Business Value Enabling new products • Sensors everywhere! • Nowcasting. • Ever-narrower segmentation of customers Analytics - taking data from input through to decision • Correlation in real time • New insights from previously hidden data: • Social • Geographical data • Recommendations. • Finding needles in haystacks. • In 2010, industry was worth more than $100 billion. • Almost 10 percent a year growth rate; • or, about twice as fast as the software business as a whole. 5
  • 6. What is Big Data? A Technical Definition • Data that exceeds the processing capacity of conventional database systems in volume, velocity or variety* or the 3Vs! Volume - Sheer size and growth. Velocity - how fast it moves. Variety - the inability to derive structure or frequency of change. * META Group (now Garner) analyst Doug Laney 6
  • 7. What is Big Volume? • 1970s: Megabytes • Now: Many organizations are approaching an Exabyte • Examples: • Google – capacity for 15 Exabytes • NSA – capacity for a Yottabyte in Utah • AWS – 1 Trillion objects in 2012 • Facebook – 500 Terabytes/day • Scientific Pursuits: • Large Hadron Collider at CERN - last year 30 Petabytes • The NASA Center for Climate Simulation - 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. • Sloan Digital Sky Survey (SDSS) 140 terabytes. 200 GB per night. Large Synoptic Telescope anticipated to acquires 140TB every five days. • eBay.com - 90 Petabytes in data warehouses 7
  • 8. What is Big Velocity? • Financial Trading volume • Retail – Cyber Monday (every click and interaction: not just the final sales). • Government – Affordable Care Act • Smartphone - geolocated imagery and audio data • Fraud (complex event processing) o Credit Card Traffic patterns o Phone Slam/Crams • Streaming – Netflix, Snapfish • Retail - Walmart handles more than 1 million customer transactions per hour. • Compuware APM (my firm) 25K transactions per second. • MMOG – Massive Multiplayer Online Games http://www.livestream.com/ibmpartnerworld/video?clipId=flv_052b14ea-7d5a-40f5-9e61-e287b0ce5d9c 8
  • 9. What is Big Variety? • Diverse source and destinations: o Document Backup or Archival – HP, EMC, AWS o Pictures and Video – Facebook 50 Billion Photos o Sensor sources – GE, NetApp o Multi-device - Dropbox and Sugarsync • Big Data is messy: o structure aids meaning, but can change frequently. o multiple sources (e.g. financial feeds, browser incompatibilities) o Application integration issues (e.g. Fitbit) o Entity resolution issues (e.g. Portland, dog) o Visualization increasingly important. 9
  • 10. Or a 4th or 5th V? 10
  • 11. Big Data and Visualization - Wikipedia 11 http://infodisiac.com/Wikimedia/Visualizations/
  • 12. What Big Data is Not (or why isn’t everything just “data”)? • There may not be the need in traditional systems o Payroll o Human Resources o Shop machine sampling? • Some tradeoffs required for the technology of Big Data: o Bleeding edge vs. established technology o Subtle definition of consistency o Complexity o New and hard-to-find skills • Make sure the business case warrants and can tolerate the tradeoffs… 12
  • 13. Big Data is Everywhere? 13
  • 14. Architectural Underpinnings: CAP Theorem 14 High Availability (A) of data for writes. Consistency (C) a single up-to-date copy of the data. Partition Tolerance (P) the system continues to operate despite arbitrary message loss or failure of parts of the system. NoSQL DBMS X
  • 15. NoSql and Eventual Consistency • Relaxed or weaker consistency to achieve High Availability and Partition Tolerance. • Eventual Consistency: – An unbounded delay in propagating changes across partitions. – No ordering guarantees at all, thus a lower-level transaction atomicity. From system perspective means an operational system is ALWAYS in an inconsistent state. • Many NoSql systems (e.g. Cassandra, Hadoop): – support self-healing or restartability. – allow ease of scale and Disaster Recovery – are schema-free. – No standard way of retrieving data (e.g. SQL). 15
  • 16. Business and Infrastructure Architecture Decisions 16 Infrastructure: • Amazon Web Services (AWS) • Storage • Elasticity • Availability • Data division: • Parallelism (sharding) • Redundancy • Application Servers • Data affinity • Stateless protocols Business Issues: • Research to Production Pipeline? • 3rd party integration needs? • Flexibility? • Radical Transparency?
  • 17. Data Architecture Decisions 17 Data In-transit: • high write vs. high reads • Encryption and security • Distributed vs. centralized • Visualization tools Data Stores: • Documents • key/value pairs • Graphs • In-memory
  • 18. Technology Stacks (Data Stores) • Hadoop: Distributed File System, Job Scheduler, MapReduce programming model – Pros: fault-tolerant, disaster protection, data parallelism fits many applications, ecosystem. – Best used: long-term data storage, research, basis of other flexible data stores. • Cassandra: Big Table, key-value – Pros: Fast writes, no single-point of failure, fault-tolerant, disaster, columns and column families. – Best used: when you write more than you read, Financial industry, real-time data analysis. • Riak: key-value – Pros: Cassandra-like, less complex, single-site scalability, availability & fault-tolerance – Best used: Point-of-sales, factory control systems, high writes. • Redis: in-memory, key-value – Pros: fast, transactional, expiring values. – Best used: Rapidly changing data in memory (stock prices, analytics, real-time data collection. • Dynamo: Big Table, key-value – Pros: Fast reads and writes, no single-point of failure, fault-tolerant, disaster, eventually consistent – Best used: Always available (e.g. Amazon). • CouchDB: Documents – Pros: bi-directional replication, conflict detection, previous versions – Best used: Accumulating occasionally changing data, pre-defined queries, versioning • MongoDB: document store. • Pros: update-in-place, defined index, built-in sharding, geospatial indexing. • Best used: Dynamic queries, schema-less, data changes a lot, • HBase: Big Table – Pros: huge datasets, map-reduce Hadoop/HDFS stack. – Best used: Analyzing log data • Memcached/Membase: in-memory and multi-node. – Pros: low-latency, high-concurrency and availability. – Best used: zynga – online gaming. • Neo4j: GraphDB – Pros: highly scalable, robust, ACID – Best used: social, routing, recommendation questions (e.g. How do I get to Linz?) 18
  • 19. Related Data Management Ecosystems • Apache Hadoop stack (Cloudera) – MapReduce on HDFS – Pig, Hive, Hbase – SQL-like DB interfaces on top of HDFS. – Flume, RabbitMQ – Data conduits and message queues. – Splunk – Operational Analytics and Log Processing. – Sqoop – Bulk Data transfer to DBs – Puppet, Chef – Configuration Management and DevOps Orchestration. – Visualization, BI and ETL - Informatica, Talend, Pentaho, Tableau • Cloud computing infrastructure (Amazon Web Services) e.g. EC2, Elastic MapReduce, RDS • Cassandra (Datastax) • High-scale, distributed and hybrid RDBMS: - Teradata - Netezza - EMC/Greenplum - Aster Data - Vertica - VoltDB - RDF Triple Stores - Hadapt 19
  • 20. Data Modeling Example: Twitter Publishers and Subscribers 20 • Relational DB: One table that has people relationships and tags whether a publisher or subscriber. Pros: No duplicated data, ACID transaction Cons: Does not scale, SPOF • NoSQL: Separate Indices for Subscribers and Publishers. Pros: – Partition Independence enables scale-out and no single point of failure for both reads and writes. – No Schema allows quick development. Cons: – Eventual consistency requires care in application layer and presentation of user experience, and future evolution. – Redundant storage.
  • 21. Gotchas - What 12 Things to Watch Out for? 1. Privacy 2. Abuse – e.g. HFT/Front running. 3. Immature technologies and companies. 4. Business and Product Changes affect on architecture. 5. Data in-transit vs. at-rest - replication, mirroring, streaming, reprocessing. 6. Data security in-transit and at-rest. 7. Blurring of high availability, performance and disaster recovery. 8. Replacing sampling and aggregation with ALL of the data! 9. Correlation is not Causation – e.g. Google Flu 10. Data Snooping (or Confirmation Bias) http://tylervigen.com 11. Irrelevance 12. Veracity - how do you check and reproduce results? “The process of making is iterative” - Cesar A. Hidalgo 21
  • 22. References and More Info • http://en.wikipedia.org/wiki/Big_data • http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html • http://strata.oreilly.com/2012/01/what-is-big-data.html • http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html • http://www.cnet.com/news/nsa-to-store-yottabytes-in-utah-data-centre/ • http://aws.typepad.com/aws/2012/06/amazon-s3-the-first-trillion-objects.html • http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/ • http://iveybusinessjournal.com/topics/strategy/why-big-data-is-the-new-competitive- advantage#.U12f8aPD9jo • http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed • Animation: Large Hadron Collider at CERN: http://home.web.cern.ch/about/updates/2013/04/animation- shows-lhc-data-processing • http://ivoroshilin.com/2012/12/13/brewers-cap-theorem-explained-base-versus-acid/ • http://www.scientificamerican.com/article/saving-big-data-from-big- mouths/?utm_content=buffer53ae6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer • http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz31AZrOtI6 • Copyleft – Share Alike - http://creativecommons.org/licenses/by-sa/3.0/ 22
  • 23. Questions? Use Cases? Technology Adoption? Feedback or follow-up: Lionel.Silberman@Compuware.com 23