跳转至

Big Data

Big Data

Big Data Specialization笔记。

I Big Data Introduction

1 Big data why and where

40% projected growth in global data generated per year vs 5% growth in global IT spending.

Cloud computing = computing anywhere and any time + dynamic and scalable data analysis

Applications: which makes big data valuable.

Big data -> better models -> higher precision

The combination of a growing torrent of data and on-demand (e.g.cloud) computing has launched the data field.

2 Characteristics of Big Data and Dimensions of Scalability

The Four V's of Big Data:

  • Volume: This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.

  • Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.

  • Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.

  • Veracity: This refers to the quality of the data, which can vary greatly.

There are many other V's that gets added to these depending on the context.

  • Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.
  • Value: Processing big data must bring about value from insights gained.

volume

Volume == Size

velocity

velocity == speed

speed of creating data speed of analyzing data speed of analyzing data

Big data -> real-time processing, late decisions leads to missing opportunities.

Batch Processing:

batch-processing

Real-Time Processing:

real-time-processing

variety

Axes of Data Variety

axes-of-data-variety

Think of an email collection

  • Sender, receiver, date… Well-structured
  • Body of the email. Text
  • Attachments Multi-media
  • Who-sends-to-whom Network
  • A current email cannot reference a past email Semantics
  • Real-time? Availability

veracity

Veracity == Quality

Veracity is very important for making big data operational. Because big data can be noisy and uncertain. Data is of no value if it's not accurate, the results of big data analysis are only as good as the data being analyzed. This is often described in analytics as junk in equals junk out.

data-uncertainty

valence

Valence == Connectedness

valence increases over time

value

3 Data Science: Getting Value out of Big Data

Data science can be thought of as a basis for empirical research where data is used to induce information for observations.

data_science

Insights often refer to the data products of data science.

data_insight

five P's

five P's of data science:

five_P_of_data_science

4 Foundations for Big Data Systems and Programming

...

5 Systems: Getting Started with Hadoop

hdfs

HDFS splits files across nodes for parallel access:

hdfs_split

HDFS is designed for fault tolerance. By default, HDFS maintains three copies of each block.

hdfs_replica

HDFS is also designed for variety of file types: text, pics....

Two key components of HDFS

  1. NameNode for metadata: Usually one per cluster

    • coordinates operations
    • keeps track of file name, location in directory, etc.
    • mapping of contents on DataNode.
    • DataNode for block storage: Usually one per machine

    • Listens to NameNode for block creation, deletion,

View in Detail

yarn

YARN - The resource manager for hadoop

Hadoop 1.0 only support, mapreduce jobs, other applications are not supported, which has poor resource utilization.

Hadoop 2.0: One dataset -> many applications

hadoop2.0

mapreduce

map and reduce are two concepts based on functional programming.

  • map = apply operation to all elements
  • reduce = summarize operation on elements

map_shuffle_sort_reduce

WordCount

Step 1: Map on each node - map generates key-value pairs

word-count-map

Step2: sort and shuffle - pairs with same key moved to same node

word_count_sort_and_shuffle

Step3: reduce - add values for same keys

word_count_reduce

MapReduce is bad for:

  • frequently changing data
  • dependent tasks
  • interactive analysis

When to reconsider Hadoop

caution_when_to_hadoop

cloud computing

If you decide to build your own hardware:

  • Hardware estimation is hard -> over estimation/under estimation
  • Software Stacks are complex
  • High Capital Investments: maintenance, procurement, disposal...

cloud-computing-your-team-cloud

cloud service models

IaaS = Get the Hardware only

  • YOU: Install and maintain OS Application Software
  • e.g. Amazon EC2

PaaS = Get the Computing Environment

  • YOU: Application Software
  • e.g. Microsoft Azure

SaaS = Get full software on-demand

  • YOU: Domain Goals
  • e.g. Dropbox