You are here: Home / Projects / Big Data

Big Data

by admin last modified Feb 22, 2017 05:35 AM

Copy/Replicate of Big Data over Internet

The project leader is Andrey Y Shevel

The Big Data is the technical term which denotes the large volume of data. The volume is so large that it is not easy to copy or transfer such the volume.

It is clear that the size of Big Data depends on the time when you are talking on the topic. For example, in 2013 we can say about big data after size around hundreds of TeraBytes (TB).

What are sources of Big Data? Many areas of human activity. We can remind experimantal data from High Energy Physics and Astrophysics, Genetics, Biology, etc. For example, you can take a look at the page [Information Revolution: Big Data Has Arrived at an Almost Unimaginable Scale]. 

For now we are planning to undertake comparative study of existing systems to copy/replicate Big Data over Internet and to develop the set of criteria to compare different systems of copying Big Data.

Later on we plan to suggest architecture of copy/repicate system for Big Data for a range of application areas: communications, scientific physics experiments, etc. 

A range of related topics

Welcome to the eduPERT Knowledge Base!!!!

Some references for Big Data transfer

      • Encoded username and password at connection

      • SSH and Certificate authentication modules

      • Multi-stream transfer

      • Big windows as defined in RFC1323

      • On-the-fly data compression

      • Automatic retry

      • Customizable time-outs

      • Transfer simulation

      • AFS authentication integration

      • RFIO interface

    • BBCP – utility to trasfer the data over network (Andrew Hanushevsky )

    • ESnet Fasterdata Knowledge Base

    • PHEDEX -

    • Berkeley Storage Manager (BeStMan)

    • FTS3 - file transfer service

    • GridFTP - Grid/Globus data transfer tool. Client part is known as globus-url-copy.

    • Efficient Data Transfer Protocols for Big Data - Brian Tierney, Ezra Kissel, Martin Swany, Eric Pouyoul

      • Lawrence Berkeley National Laboratory, Berkeley, CA 94270

      • School of Informatics and Computing, Indiana University, Bloomington, IN 47405

      • Abstract—Data set sizes are growing exponentially, so it is important to use data movement protocols that are the most efficient available. Most data movement tools today rely on TCP over sockets, which limits flows to around 20Gbps on today’s hardware. RDMA over Converged Ethernet (RoCE) is a promising new technology for high-performance network data movement with minimal CPU impact over circuit-based infrastructures. We compare the performance of TCP, UDP, UDT, and RoCE over high latency 10Gbps and 40Gbps network paths, and show that RoCE-based data transfers can fill a 40Gbps path using much less CPU than other protocols. We also show that the Linux zero-copy system calls can improve TCP performance considerably, especially on current Intel “Sandy Bridge”-based PCI Express 3.0 (Gen3) hosts.

    • Big Data @ NIST.GOV
    • Big Data @ San Diego Supercomputer Center
    • Big Data @ HP
    • Big Data @ IBM
    • Big Data @ Oracle
    • Picture of Big Data from IBM point of view Various features of Big Data