Python Data Types and Structures
Python's built-in data types are the foundation of every data science project. Choosing the right structure affects performance, readability, and correctness. This lesson covers every major type you will use daily.
Python Data Type Hierarchy
Numeric Types
Integers
Integers are whole numbers with no decimal point. Python handles arbitrarily large integers without overflow.
# Integers in Python
population = 1_400_000_000 # Underscores for readability
negative = -42
binary = 0b1010 # Binary literal = 10
hex_val = 0xFF # Hex literal = 255
print(type(population)) # <class 'int'>
print(population.bit_length()) # 31
Floats
Floats are decimal numbers. They use IEEE 754 double precision, which means ~15-17 significant digits.
# Floats
pi = 3.14159265358979
avogadro = 6.022e23 # Scientific notation
# Beware of floating point precision
print(0.1 + 0.2) # 0.30000000000000004
print(0.1 + 0.2 == 0.3) # False
# For precise decimals (financial), use the decimal module
from decimal import Decimal
a = Decimal("0.1") + Decimal("0.2")
print(a == Decimal("0.3")) # True
Complex Numbers
Used in signal processing and physics simulations.
z = 3 + 4j
print(z.real) # 3.0
print(z.imag) # 4.0
print(abs(z)) # 5.0 (magnitude)
Booleans
Booleans are a subclass of integers. True equals 1 and False equals 0.
is_active = True
has_data = False
# Boolean operations
print(is_active and has_data) # False
print(is_active or has_data) # True
print(not is_active) # False
# Truthy and Falsy values
print(bool(0)) # False
print(bool("")) # False
print(bool([])) # False
print(bool(None)) # False
print(bool(42)) # True
print(bool("hello")) # True
Strings
Strings are immutable sequences of Unicode characters. You will use them extensively for text processing in data science.
name = "Data Science"
print(len(name)) # 12
print(name[0]) # 'D'
print(name[-1]) # 'e'
print(name[0:4]) # 'Data'
# Strings are immutable
# name[0] = "d" # TypeError!
# Common methods
text = " Hello, World! "
print(text.strip()) # "Hello, World!"
print(text.lower()) # " hello, world! "
print(text.upper()) # " HELLO, WORLD! "
print(text.replace("World", "Python")) # " Hello, Python! "
print(text.split(",")) # [' Hello', ' World! ']
print("-".join(["a", "b", "c"])) # "a-b-c"
Lists
Lists are ordered, mutable sequences. They are the most versatile data structure in Python.
# Creating lists
numbers = [1, 2, 3, 4, 5]
mixed = [1, "hello", 3.14, True, None]
nested = [[1, 2], [3, 4], [5, 6]]
empty = []
# Accessing elements
print(numbers[0]) # 1 (first)
print(numbers[-1]) # 5 (last)
print(numbers[1:3]) # [2, 3] (slice)
# Modifying
numbers.append(6) # Add to end
numbers.insert(0, 0) # Insert at index
numbers.extend([7, 8]) # Add multiple
numbers.remove(3) # Remove first occurrence
popped = numbers.pop() # Remove and return last
del numbers[0] # Delete by index
# List operations
a = [1, 2, 3]
b = [4, 5, 6]
print(a + b) # [1, 2, 3, 4, 5, 6]
print(a * 2) # [1, 2, 3, 1, 2, 3]
print(3 in a) # True
print(len(a)) # 3
# Sorting
nums = [3, 1, 4, 1, 5, 9, 2, 6]
nums.sort() # In-place sort
print(nums) # [1, 1, 2, 3, 4, 5, 6, 9]
sorted_nums = sorted(nums, reverse=True) # New sorted list
print(sorted_nums) # [9, 6, 5, 4, 3, 2, 1, 1]
When to Use Lists
- You need an ordered collection that changes over time.
- You want to append, insert, or remove elements frequently.
- You need duplicate values.
- You want to iterate in insertion order.
Tuples
Tuples are ordered, immutable sequences. They are faster than lists and can be used as dictionary keys.
# Creating tuples
point = (3, 4)
color = (255, 128, 0)
single = (42,) # Note the trailing comma for single-element tuple
not_a_tuple = (42) # This is just the integer 42
# Accessing (same as lists)
print(point[0]) # 3
print(point[-1]) # 4
# Unpacking
x, y = point
print(f"x={x}, y={y}") # x=3, y=4
# Multiple assignment
a, b, c = 1, 2, 3
# Swap variables
a, b = b, a
# Tuple methods
nums = (1, 2, 2, 3, 3, 3)
print(nums.count(3)) # 3
print(nums.index(2)) # 1
# Tuples as dictionary keys (lists cannot be keys)
location = {(40.7128, -74.0060): "New York", (51.5074, -0.1278): "London"}
When to Use Tuples
- Data should not change (coordinates, RGB colors, database rows).
- You need a hashable type (dictionary keys, set elements).
- Performance matters β tuples are slightly faster than lists.
- You want to enforce immutability as a design constraint.
Dictionaries
Dictionaries store key-value pairs. They are the most important data structure for structured data work.
# Creating dictionaries
person = {"name": "Alice", "age": 30, "city": "New York"}
from_keys = dict.fromkeys(["a", "b", "c"], 0) # {'a': 0, 'b': 0, 'c': 0}
empty_dict = {}
# Accessing
print(person["name"]) # "Alice"
print(person.get("salary", 0)) # 0 (default if key missing)
# Modifying
person["age"] = 31 # Update
person["email"] = "a@b.com" # Add new key
del person["city"] # Delete key
# Iterating
for key in person:
print(key, person[key])
for key, value in person.items():
print(f"{key}: {value}")
# Useful methods
print(person.keys()) # dict_keys(['name', 'age', 'email'])
print(person.values()) # dict_values(['Alice', 31, 'a@b.com'])
print("name" in person) # True
# Dictionary comprehension
squares = {x: x**2 for x in range(10)}
evens = {x: x**2 for x in range(10) if x % 2 == 0}
Nested Dictionaries
students = {
"alice": {"age": 22, "grades": [90, 85, 92]},
"bob": {"age": 23, "grades": [78, 82, 88]},
}
# Access nested values
print(students["alice"]["grades"][0]) # 90
When to Use Dictionaries
- You need fast lookups by a unique key (O(1) average).
- You represent structured records or JSON-like data.
- You need to map one set of values to another.
- Data science: column-based data, feature dictionaries, configuration.
Sets
Sets are unordered collections of unique elements. They are optimized for membership testing and set operations.
# Creating sets
fruits = {"apple", "banana", "cherry"}
numbers = set([1, 2, 2, 3, 3, 3]) # {1, 2, 3}
empty_set = set() # NOT {} (that creates a dict)
# Adding and removing
fruits.add("date")
fruits.remove("banana")
fruits.discard("fig") # No error if missing
# Set operations
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(a | b) # Union: {1, 2, 3, 4, 5, 6}
print(a & b) # Intersection: {3, 4}
print(a - b) # Difference: {1, 2}
print(a ^ b) # Symmetric difference: {1, 2, 5, 6}
# Membership testing (very fast)
print(3 in a) # True
# Practical use: finding unique values
df_column = ["cat", "dog", "cat", "bird", "dog", "cat"]
unique_values = set(df_column)
print(unique_values) # {'cat', 'dog', 'bird'}
When to Use Sets
- You need to remove duplicates quickly.
- You need fast membership testing.
- You need mathematical set operations (union, intersection, difference).
- You are comparing two collections for overlap.
Type Conversions
# Explicit conversion (casting)
int("42") # 42
float("3.14") # 3.14
str(100) # "100"
list("abc") # ['a', 'b', 'c']
tuple([1, 2, 3]) # (1, 2, 3)
set([1, 1, 2]) # {1, 2}
dict([("a", 1), ("b", 2)]) # {'a': 1, 'b': 2}
# Common pitfalls
print(int("3.14")) # ValueError! Use float() first
print(float("3.14")) # 3.14
print(int(3.14)) # 3 (truncates, does not round)
print(int(3.7)) # 3
Type Conversion Rules
Type conversions follow implicit and explicit rules. Here are the mathematical representations:
Implicit Promotion Order:
Explicit Casting Rules:
Choosing the Right Structure
Need ordered + mutable? β List
Need ordered + immutable? β Tuple
Need key-value pairs? β Dict
Need unique values? β Set
Need fast lookup by key? β Dict
Need fast membership testing? β Set
Need to enforce no duplicates? β Set
Performance Comparison
import time
# Membership testing speed
large_list = list(range(1_000_000))
large_set = set(range(1_000_000))
start = time.time()
999_999 in large_list
list_time = time.time() - start
start = time.time()
999_999 in large_set
set_time = time.time() - start
print(f"List: {list_time:.4f}s") # ~0.01s
print(f"Set: {set_time:.6f}s") # ~0.000001s
# Sets are orders of magnitude faster for membership testing
Key Takeaways
- Lists are your default ordered collection; use them when you need mutability.
- Tuples protect data from accidental modification and work as dict keys.
- Dictionaries are essential for structured data and fast lookups.
- Sets are irreplaceable for deduplication and membership testing.
- Always choose the structure that best matches your data's constraints and access patterns.